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Abstract 

Motivated by authentication, intrusion and spam detection applications we consider 
single-class classification (SCC) as a two-person game between the learner and an adver- 
sary. In this game the learner has a sample from a target distribution and the goal is to 
construct a classifier capable of distinguishing observations from the target distribution 
from observations emitted from an unknown other distribution. The ideal SCC classifier 
must guarantee a given tolerance for the false-positive error (false alarm rate) while mini- 
mizing the false negative error (intruder pass rate). Viewing SCC as a two-person zero-sum 
game we identify both deterministic and randomized optimal classification strategies for 
different game variants. We demonstrate that randomized classification can provide a sig- 
nificant advantage. In the deterministic setting we show how to reduce SCC to two-class 
classification where in the two-class problem the other class is a synthetically generated 
distribution. We provide an efficient and practical algorithm for constructing and solv- 
ing the two class problem. The algorithm distinguishes low density regions of the target 
distribution and is shown to be consistent. 



1. Introduction 



In Single- Class Classification (SCC) the learner observes a training set of sampled instances 
from one target distribution. The goal is to create a classifier that can distinguish instances 
emitted from distributions other than the target distribution and unknown to the learner 
during training. This SCC problem can model many applications such as intrusion, fault 
and novelty detection. For example, in an instance of an intrusion detection problem (see 
e.g., Nisenson, Yariv, El-Yaniv, & Meir, 2003), the goal is to create a classifier that can 
distinguish 'legal' users from intruders based on behaviometric or biometric patterns. This 
classifier can then be used to guard against illegal attempts to gain access into protected 
systems or regions. 
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Single-class classification (also termed one-class classification) has been receiving consid- 
erable research attention in the machine learning and pattern recognition communities. For 
example, only the survey papers (Markou & Singh, 2003a, 2003b; Hodge & Austin, 2004) 
cite, altogether, over 100 SCC papers. Most SCC works implicitly assume that a good 
solution can be achieved by identifying low density regions of the target distribution and 
then, the objective is to reject sub-domains of low density. Thus, the main consideration 
in previous SCC studies has been statistical: how can a prescribed false positive rate be 
guaranteed given a finite sample from the target distribution. 

The proposed approaches are typically generative or discriminative. Generative solu- 
tions range from full density estimation (Bishop, 1994), to partial density estimation such 
as quantile estimation (G. Lanckriet, Ghaoui, & Jordan, 2002), level set estimation (Ben- 
David Sz Lindenbaum, 1995; Steinwart, Hush, & Scovel, 2005) or local density estimation 
(Breunig, Kriegel, Ng, & Sander, 2000). In discriminative methods one attempts to gen- 
erate a decision boundary appropriately enclosing the high density regions of the training 
set (Yu, 2005). In addition to such constructions, there are many empirical studies of the 
proposed solutions. Nevertheless, it appears that the area suffers from a lack of theoretical 
contributions and principled (empirical) comparative studies of the proposed solutions. 

Motivated mainly by intrusion detection applications, in this paper we examine the SCC 
problem from an adversarial viewpoint where an adversary selects the attacking distribution. 
We begin by abstracting away the statistical estimation component of the problem by 
considering a setting where the learner has a very large sample from the target distribution. 
This setting is modeled by assuming that the learning algorithm has precise knowledge 
of the target distribution. While this assumption would render almost the entire body of 
SCC literature superfluous, it turns out that a significant and non-trivial decision-theoretic 
component of the adversarial SCC problem remains ~ one that has so far been overlooked. 
For a discrete version of the SCC problem we provide an in depth analysis of adversarial SCC 
and identify optimal strategies for variants of the problem depending on whether or not the 
learner can play a randomized strategy and on various constraints on the adversary. As a 
consequence of this analysis, it can be demonstrated that a randomized learner strategy can 
be superior on average to standard deterministic classification. For an infinitely continuous 
version of this game we provide a simple and consistent SCC algorithm that implements the 
standard low-density rejection by reducing the SCC problem to two-class soft classification. 

The body of this paper contains the principal results that are simpler to present. The 
appendices contain some of the more technical proofs, to the presented results. An earlier 
version of this work containing a subset of the results was presented at NIPS (El-Yaniv Sz 
Nisenson, 2006). Extensions to this work can be found in the thesis of (Nisenson, 2010). 
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2. Problem Formulation 

We define the adversarial single-class classification (SCC) problem as a two-person zero- 
sum game between the learner and an adversary. The learner receives a training sample 
of examples from a target distribution P defined over some space il. On the basis of this 
training sample, the learner should select a rejection function r : — ?• [0, 1], where for each 
u £ il., r{oj) is the probability with which the learner will reject uo. On the basis of any 
knowledge of P and/or r(-), the adversary selects an attacking distribution Q, defined over 
Q. Then, a new example is drawn from jP + (1 — j)Q, where < 7 < 1, is a switching 
probability unknown to the learner. 

The rejection rate of the learner, using a rejection function r, with respect to any 
distribution D (over n), is p{r,D)=ED{r{uj)}. The two main quantities of interest here 
are the false positive rate (type I error) p{r,P), and the false negative rate (type II error) 

I — p{r, Q). Before the start of the game, the learner receives a tolerance parameter < 6 < 
1, giving the maximally allowed false positive rate. A rejection function r(-) is valid if its 
false positive rate satisfies the constraint p(r, P) < 6. A valid rejection function (strategy) 
is optimal if it guarantees the smallest false negative rate amongst all valid strategies. 

This setting conveniently models various SCC applications and in particular, intrusion 
detection problems. For example, considering biometric authentication, the false alarm 
rate p(r, P) is the rejection (failed authentication) rate of the legal users and p{r, Q) is the 
rejection rate of intruders, which should be maximized. 

Remark 1. Clearly, a dual SCC problem can be formulated where a sufficiently high 
intruder rejection rate must be guaranteed and the false alarm rate should be minimized. 
We briefly discuss this dual problem and its relation to the "primal" in Section 8. Other 
types of SCC problems can be considered where the loss is a function of the type I and type 

II errors. For example, one may be interested in minimizing a convex combination of these 
errors. Any such loss function can be handled using our definition and searching for the 6 
for which the SCC solution optimizes the desired loss function. 

Our analysis begins by focusing on the Bayes decision theoretic version of the SCC 
problem in which the learner knows the target distribution P precisely. The problem is 
thus viewed as a two-person zero sum game where the payoff to the learner is p{r, Q). The 
set TZs{P)={r : p{r, P) < 5} of valid rejection functions is the learner's strategy space. We 
denote by Q be the strategy space of the adversary, consisting of all allowable distributions 
Q that can be selected by the adversary.^ 

1. The game can be expressed in 'extensive form' (i.e., a game tree) where in the first move the learner 
selects a rejection function, followed by a chance move to determine the source (either P or Q) of the 
test example (with probability 7). In the case where Q is selected, the adversary chooses (randomly 
using Q) the test example. In this game the choice of Q depends on knowledge of P and r(-). 
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We are concerned with optimal learner strategies for game variants distinguished by the 
adversary's knowledge of the learner's strategy, P and/or of 5 and by other limitations on Q. 
We also distinguish a special type of this game, which we call the hard setting in which the 
learner is constrained to employ only deterministic reject functions; that is, r : — t- {0, 1}, 
and such rejection functions are termed "hard." The more general game defined above 
(with "soft" functions) is called the soft setting. As far as we know, only the hard setting 
has been considered in the SCC literature thus far. The reason for considering soft rejection 
functions is that they can achieve significant advantage in terms of type II error reduction. 
Later on in Section 6.2.1 we numerically demonstrate such error reductions. 

For any rejection function, the learner can reduce the type II error by rejecting more 
(i.e., by increasing r(-)). Therefore, in the soft setting for an optimal r(-) we must have 
p{r, P) = 5 (rather than p{r, P) < 6). It follows that the switching parameter 7 is immaterial 
to the selection of an optimal strategy. 

Given an adversary strategy space, Q, we define the set TZg{P) of optimal valid rejection 
functions as 7^^={r G Tls{P) : minggg p{r, Q) = maXj,/g7^^(p) miug/gg p{r\ Q')}? We note 
that TZ*^ is never empty in the cases we consider. 

3. Related Work 

One-Class Classification is often given different names, depending on the desired use. For 
example, other common names include outlier detection, fault detection and novelty de- 
tection. Historically, one of the earliest works is due to Grubbs (1969) who considered 
in-sample outlier detection. Grubbs calculates a cut-off statistic for determining outliers in 
the 1-dimensional Gaussian case at the 5%, 2.5% and 1% significance levels within samples 
of various sizes. Minter (1975) appears to be the first to use the term "single-class classifi- 
cation" . Minter starts from a fairly standard two-class approach, assuming that there is a 
class of interest (class 1) and a class of "others" (class 0). Given the switching parameter 
7 (which is the a priori probability of class 1), Minter gives the rule to accept a point x iff 
7Pr{x|l} > (1 — 7) Pr{2;|0}, which is equivalent to 7 Pr{2;|l} > ^Pr{a;}. It is assumed that 
both 7 and Prjx} are known or can be estimated from historical data, leaving the problem 
of estimating Pr{x|l} from the given sample. While, technically, only a sample from the 
class of interest is given, the additional assumptions make this a modified form of a two-class 
problem.^ These are the earliest explicit works we have found. Note that statisticians have 

2. For certain strategy spaces, Q, it may be necessary to consider the infimum rather than the minimum. 
In such cases it may be necessary to replace 'Q € Q' (in definitions, theorems, etc.) with 'Q G d{Q)\ 
where d{Q) is the closure of Q. 

3. This differs from more recent works, where 7 and Prfa;} are assumed to be unknown (whereby the 
learner's knowledge is much more restricted), and the type I error is required to not exceed a bound, 5, 
which is the setting we use in this work. 



4 



Foundations of Adversarial Single-Class Classification 



long been considering the two-sample problem, which is similar but perhaps simpler. One 
can view the SCO problem as an extremely unbalanced instance of the two-sample problem 
that prevents using the standard statistical hypothesis testing techniques. 

Since virtually all prior works on SCO that we have encountered deal with how to 
approximate a low-density rejection strategy given a set, {xi, . . . ,Xn}, of training points, 
sampled from the class of interest, we will focus our review here on such methods. 

We begin with discussing support, quantile and level-set estimation. Support estimation 
aims to estimate the support of a density p. In terms of outlier detection, the goal is 
clear: a point falling outside the estimated support is taken to be an outlier. One of the 
simpler methods, analyzed by Devroye and Wise (1980), is to estimate the support as Sn = 
Ur=i -^(^ii ^n), where B{x, a) is a closed ball centered at x with radius a (i.e. — x| | < a, 
for some norm || • ||), and e„ is a (vanishing) sequence of smoothing parameters. In quantile 
estimation, the goal is to find a set U{^) such that X{U{I5)) = inf5{A(<S') : P{S) > /?}, 
where A is a real valued function. For our purposes, we take A as the Lebesgue measure, 
in which case the problem is also called minimum volume estimation. When /3 = this 
becomes support estimation, and when P = 1 — 6 this problem is the same as low-density 
rejection. In level-set estimation, the goal is to approximate the set C{t) = {x : p{x) > t} 
(or alternatively as {x : p{x) > t}). Of course, level-set estimation can be used for support 
estimation by taking i = or by taking t = t„ as a sequence which approaches zero 
(see Cuevas &: Fraiman, 1997). Clearly, level-set estimation approximates the low-density 
rejection strategy when P{C{t)) = 1 — 5. A significant amount of prior SCO works have 
focused on minimum volume and level-set estimation. We distinguish between explicit and 
implicit methods, where explicit methods try to directly solve one of the problems, and 
implicit methods which use a heuristic which may or may not give the desired result. We 
note that whether the method is explicit or implicit is not necessarily an indicator of whether 
the underlying model is generative or discriminative, although there is a clear tendency for 
explicit methods to be generative. Transformations from the one-class setting to the two- 
class setting tend to be implicit and discriminative. We will consider minimum volume 
estimation approaches first and then look at various level-set estimation results. Finally we 
will examine other results, including transformations to the two-class setting. 

Minimum volume estimation has been a favored approach at solving the SCC problem 
in the literature. This perhaps is due to two works which reused the popular Support Vector 
Machine (SVM, see Vapnik, 1998) from two-class classification problems. The earlier work 
(D. Tax & Duin, 1999) sought to fit the sample data inside a sphere of minimal radius, 
a solution they called the Support- Vector Data Description (SVDD). Specifically, given a 
sphere with center a and radius R, the error function to be minimized is E? + C , under 
the constraints (xj — a)^(xj — a) < where C is a regularization term which relates to 
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the type I error. Outliers in the sample data would lie on, or outside the sphere (and have 
> 0). The kernel trick was then employed to allow for solving the problem in a higher 
dimensional feature space. They note that polynomial kernels do not result in small volumes 
in the input space, as points distant from the origin tend to have high error values. They 
found that Gaussian kernels worked well. The type I error can be estimated from the number 
of support vectors divided by the sample size, n, where the support vectors are the points 
lying on the sphere (i.e. they define the sphere's boundary). Changing the regularization 
parameter C, or the bandwidth parameter of the Gaussian kernel, can be used to control the 
trade-off between the volume of the sphere and the number of support vectors. In a follow 
up work, D. M. J. Tax and Duin (2001) show how samples from a uniform distribution 
can be used to optimize for both parameters simultaneously. The second work (Scholkopf, 
Piatt, Shawe- Taylor, Smola, & Williamson, 2001) introduced what is commonly called the 
One-Class Support Vector Machine (OC-SVM). The technique used is that of a standard 
two-class SVM where the second class is the origin (in feature space). In other words, a 
hyper-plane is sought which maximizes the soft-margin between the origin and the sample 
points. Points lying on the "wrong" side of the hyper-plane are outliers. The kernel trick 
can also be employed for OC-SVM. Scholkopf et. al show that for kernels k{x, y) that depend 
only on X — y, such as the Gaussian kernel, the solutions found by OC-SVM and SVDD 
are identical. They further showed that the value J' = where C is the regularization 
parameter in the SVM equation, is an upper bound on the number of outliers, a lower 
bound on the number of support vectors, and that for probability measures P without 
discrete components, asymptotically the number of outliers and support vectors are equal, 
in probability. Vert and Vert (2006) correctly point out that while OC-SVM can guarantee 
the type I error, no guarantees are made regarding consistency of the result (i.e., whether 
the result converges to a region of minimum volume) . This same point is valid for SVDD as 
well. Indeed, the poor performance of SVDD using polynomial kernels is sufficient proof that 
the minimum volume set (in the original feature space) is not found. Thus, both of these 
approaches are implicit, as they do not explicitly solve for the minimum volume set. Similar 
results for the Minimax Probability Machine (where the type I error is bounded but the 
resulting set does not necessarily have the minimum volume) are provided by Lanckriet et. 
al (G. R. G. Lanckriet, Ghaoui, Bhattacharyya, &: Jordan, 2002; G. Lanckriet et al., 2002). 
C. D. Scott and Nowak (2006) overcome these limitations where they use Empirical Risk 
Minimization to prove consistency (in a distribution free manner) and convergence rates of 



specifically there is a requirement which can be satisfied if p has no plateaus). C. Scott 
(2007) expands on this analysis, which served as the basis for the 2-class SVM approach used 
in (Davenport, Baraniuk, & Scott, 2006), where the second class is the uniform distribution. 




using Structural Risk Minimization for trees (these results aren't distribution free; 
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The results significantly outperformed those of OC-SVM (i.e. a significantly smaller volume 
was found for approximately the same type I error). 

We now turn our attention to level-set estimation. Let Cn{t) be the estimation of C{t) 
given the n sample points. One of the most common error measures is \{C{t) A Cn{t)), where 
A is the Lebesgue measure and A is the symmetric difference (i.e. AaB = {A\B) [J{B\A)). 
Another common measure is Hp(£,{t),t) — Hp{Cn{t),t), where Hp{S,t) = P{S) — tX{S) is 
the excess mass of S. Both of these measures are non-negative and equal to zero at the 
optimal solution. Much of the prior work which explicitly solves the level-set estimation 
problem shows consistency by proving that as n goes to infinity, one of these two measures 
goes to zero. Most recent work focuses on calculating convergence rates under various 
conditions on the density p. One of the most common techniques for level-set estimation 
is the plug-in estimate where Cn{t) = {x : Pn{x) > t}, for a density estimate pn of p. The 
kernel density estimate (Parzen, 1962) is most often used. For a thorough analysis of the 
plug-in estimate (in terms of consistency and convergence rates) see Cuevas and Fraiman 
(1997); Cadre (2006); Rigollet and Vert (2008). Interestingly, the SCO community appears 
to have been inclined to pursue alternate and novel approaches over the straight-forward 
use of the kernel density estimate as part of the plug-in estimator. It must be stressed 
that these approaches have largely been implicit, in the sense that they are based on either 
a heuristic or some other approximation, and consistency is not proven. For example, 
Breunig et al. (2000) develop a measure they call the Local Outlier Factor (LOF). LOF is 
calculated based on a smoothed k-nearest-neighbor distance, where the LOF is calculated 
as an average ratio of these distances between the neighbors of a point and the point itself. 
In other words, the LOF is calculated so that objects "deep within a cluster" will have a 
LOF of approximately 1, while objects near edges of clusters or far from other points will 
have large values. This seems to be a heuristic way of estimating f{p) where / is hoped 
to be a monotonically decreasing function. Hempstalk, Frank, and Witten (2008) use the 
plug-in estimate approach where they use a rather different way of establishing pn- Using 
Minter's notation from above, they generate an artificial distribution for class 0, and then 
it follows from Bayes Theorem that: 

, , , Pr|0|Pr|l|x| , 

Since the artificial distribution is known, and the prior can be controlled, Pr{a;|l} can be 
estimated from Pr{l|j;}, which is estimated using class-probability estimation techniques, 
specifically bagged trees with Laplacian smoothing. In practice, they use a density estimate 
of p to establish the density for the artificial set. While the technique is certainly interesting, 
it would be of great interest to see if consistency or convergence rates could be proven. Vert 
and Vert (2006) demonstrated that one need not estimate the density directly in order 
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to determine the level-set. They prove that an SVM, with a convex loss function and 
Gaussian kernel with a "well-calibrated bandwidth cr," can produce an estimate Cn{t), such 
that linxin^oc Hp{C{t),t) — Hp{Cn{t),t) = 0, in probability. Steinwart, Hush, and Scovel 
(2004) provide convergence rates when using Ll-SVM for the error measure ^^{C{t) ACn{t)), 
where /i is a reference probability distribution. 

Finally, we consider other works, starting with transformations to the two-class setting. 
All of these approaches rely on the creation of a second class in the vicinity of the tar- 
get class. Examples of this are (Banhalmi, Kocsor, & Busa-Fekete, 2007) where SVM is 
used to separate between the two classes, and (Curry & Heywood, 2009), where genetic 
programming is used and the fitness function accounts for overlap between the two classes. 
Other works, such as (Ratsch, Mika, Scholkopf, & Miiller, 2002), look at how boosting can 
be applied in the one-class setting. A recent and interesting work is by Juszczak, Tax, 
Pekalska, and Duin (2009), which uses the premise that the target class should largely be 
continuous; in other words, if two points belong to the target class, there should be a path 
from one to the other. For points which are very close to each other, we may expect this to 
be a straight line. They propose building a minimum spanning tree covering the data, and 
test membership to the target class by testing the distance of a point to the tree. Since the 
continuity assumption may be violated for points in different clusters, they allow for the 
removal of edges in the tree, where longer edges are better candidates for removal. They 
also allow for a form of dimensionality reduction by removing the shortest paths in the tree. 
The approach has very good performance on the tested data sets, and it would be of great 
interest to see if the authors can develop consistency or other theoretical results for it. 

4. An Informal Look - an Investment/ROI Analogy 

To gain some insight into the one-class classification setting, we now describe an analogous 
investment game. The learner is given an amount of money to invest, 6. There are assets 
which can be invested in, with a cost of pi to invest in asset i. For each asset i, the learner 
purchases an amount r(i) G [0, 1] (i.e., from none to all of an asset) and then sells it at a 
price Qi, determined by the adversary. Any monies not invested are lost. Since the initial 
wealth is 6, the allocation strategy r(-) must satisfy ^ ■ r{i)pi < 6. The overall return to be 
maximized is '^,ir{i)qi. 

Clearly, the Return-On-Investment (ROI) for asset i is j-, and thus the learner should 
invest in assets which have the highest ROI (where free assets are taken to have infinite 
ROI). In the SCC setting, the fact that the learner must select the investment strategy, r(-), 
before the adversary determines the selling prices, clearly makes this a difficult proposition. 
Had we reversed the order, and the adversary were to determine the selling prices first, we 
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would have a two-class classification problem (i.e., the learner, with full knowledge of both 
classes, is to minimize type II error subject to a maximum type I error). In this case, the 
learner's optimal investment strategy would be clear: 

The learner shouldn't invest in an asset k, unless all assets with a higher ROI 
than k have already been purchased. 

Note that while this strategy applies to the soft setting (r(z) G [0, 1]), the optimal solution 
is very nearly identical to that of the hard solution G {0, 1}), with the only difference 
being that any left over money is invested. How does this investment strategy translate from 
the two-class classification setting to our original one-class classification setting, where the 
learner must invest without knowing the ROI values? Clearly, if the adversary's strategy 
space has some inherent constraints on the relative ROI of assets, then the learner could 
take advantage of them. For example, in the simplest case, if the adversary's strategy 
space enforces an ordering on the ROI values, for example j < k ^ ^ < then the 
learner can invest optimally without knowing Q. However, the less the adversary's strategy 
space constrains the relative ROI of assets, the more difficult the learner's task is. We 
would intuitively expect that, in the face of an adversary determined to minimize the 
learner's return, that less constraints on the adversary would force the learner to diversify 
his investment. In the extreme case of no constraints at all on the adversary, the learner 
should purchase the same amount of every non-free asset. ^ We also note that the more the 
learner diversifies, the "further" his investment strategy becomes relative to the optimal 
two-class strategy (in accordance to known ROI values). 

5. On the Optimality of Monotone and Low-Density Rejection Functions 

The vast majority of the literature on SCC deals with various techniques for implementing 
the Low-Density Rejection Strategy (LDRS). This raises the question of whether such a 
strategy is optimal or not, and under what conditions may it be reasonable to use such a 
strategy. Since we are interested in adversarial applications, worst-case performance is a 
natural measure for us to consider. For example, if one considers an authentication system 
every attempt to gain access results in either access being granted or an alarm being fired. 
From a worst-case perspective, we should expect a sophisticated intruder to be capable of 
spying upon legitimate use of the system for some period of time and seeing what events 
or patterns should provide access. Thus, it is more likely that the intruder will attempt to 
enter a highly probable event in order to gain access, rather than a low-probability event. 

4. Note that this is difTerent than 'dollar-cost averaging'; the same amount of money isn't spent on each 
asset, rather the same absolute amount of each asset is purchased. This guarantees the learner a total 
ROI of at least 1 (i.e., for every dollar invested, a dollar is earned upon selling). 
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In fact, the intruder's distribution could be even more concentrated on the highly-probable 
events than the user's! 

Viewed in this perspective, it is not at all clear at the outset that the standard LDRS 
approach to SCC is the best for adversarial applications. By constraining the adversary's 
strategy space to one where all of the distributions are tightly concentrated on the highly- 
probably events under P, low-density-rejection may not be an optimal strategy for the 
learner. In the extreme case where the adversary always plays the most probable event 
under P, the adversary would always be able to gain access if the learner plays the low- 
density-rejection strategy, while potentially the learner could completely deny the adversary 
access if 5 is greater than the probability for that event. Clearly, the nature of the constraints 
placed on the adversary is critical not only in terms of whether LDRS is optimal, but also 
in terms of the error that is achievable (both by LDRS and by other strategies). Here we 
address the former issue, which we feel is of particular relevance considering the large body 
of existing work which examines approximating low-density rejection functions^ that can 
be leveraged in solving practical problems, and leave the latter for future research. 

The partially good news is that low-density rejection is worst-case optimal if the learner 
is confined to "hard" decisions and when the adversary is strong enough in the sense that 
her strategy space is sufficiently large as shown in Theorem 10. However, as we demonstrate 
in Section 6, LDRS is inferior in general to the optimal soft strategy. Thus, by playing a 
randomized strategy, a very significant gain can be achieved. 

In this section, we assume a finite support of size N; that is, ft = {1,...,A^} and 
P={pi, . . . ,Pn} and Q={qi, ■ ■ ■ ,qN} are probability mass functions. Note that this as- 
sumption still leaves us with an infinite game because the learner's pure strategy space, 
TZs{P), is infinite. Extensions to infinite support (A^ — )■ oo) for many of the finite support 
results are given in Nisenson (2010). A simple observation is that for any r G TZ'^ there 
exists r' G 7^^ such that r'{i) = r[i) for all i such that > and for zero probabilities, 
Pj = 0, r' [j) = 1. We thus assume w.l.o.g. that > for all i ^Vt. 

While the low-density rejection strategy implies an assumption that lower probability 
events should be completely rejected, we instead examine a weaker, but perhaps more useful, 
condition. Intuitively, it seems plausible that the learner should not assign higher rejection 
values to higher probability events under P. That is, one may expect that a reasonable 
rejection function r(-) would be monotonically decreasing with probability values. In the 
ROI analogy, we would state this as "the learner should prefer cheaper assets to more 
expensive ones." This is appealing, as more of a cheaper asset can be purchased for the 
same amount of money than a more expensive asset, and a lower selling price is necessary 
to achieve the same ROI. We now define two types of monotonicity. 

5. See, e.g., (Scholkopf et al., 2001; Cuevas & Fraiman, 1997; Cadre, 2006; Breunig et al., 2000). 
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Definition 2 (Monotonicity). A rejection function r(-) is monotone if pj < Pk ^ '"(i) > 
r{k). A monotone rejection function r(-) is strictly monotone if pj = Pk ^ r{j) = r{k). 

We note that completely rejecting null-events under P (i.e., pj = ^ r{j) = 1) does not 
break strict-monotonicity so our assumption that there are no null events under P is taken 
w.l.o.g. Surprisingly, optimal monotone strategies are not always guaranteed as shown in 
the following example. 

Example 1 (Non-Monotone Optimality). In the hard setting, take N = 3, P = 
(0.06, 0.09, 0.85) and 5 = 0.1. The two 5-valid hard rejection functions are r' = (1,0, 0) and 
r" = (0,1,0). Let Q = {Q = (0.01,0.02,0.97)}. Clearly p{r',Q) = 0.01 and p{r",Q) = 0.02 
and therefore, r"{-) is optimal despite breaking monotonicity. More generally, this example 
holds if Q = {Q : (72 — 9i > e} for any < e < 1. 

In the soft setting, let iV = 2, P = (0.2,0.8), and 6 = 0.1. We note that 7^^(P) = 
{r^ = (0.1 +4e,0.1 - e)}, for e G [-0.025,0.1]. We take Q = {Q = (0.1,0.9)}. Then 
P'^iQ) = 0.1 + 0.4e — 0.9e = 0.1 — 0.5e. This is clearly maximized when we minimize e 
by taking e = —0.025, and then the optimal rejection function is (0,0.125), which clearly 
breaks monotonicity. This example also holds for Q = {Q : q2 > cqi} for any c > 4. 

This example naturally raises the question of which conditions are necessary or sufficient 
for optimal monotone strategies to be guaranteed. To motivate our sufficient condition for 
optimality (Property A below), recall the intrusion detection setting discussed in the be- 
ginning of this section. There the adversary is constrained to distributions that are tightly 
concentrated on the highly probably events under P. In this case, since low probabil- 
ity events are scarcely "attacked" by the adversary, the optimal learner would not waste 
rejection "resources" on low probability events. In other words, in such cases monotone 
rejection functions aren't optimal. This begets the question if monotone rejection functions 
are optimal when the adversary is not constrained from attacking low probability events. 

Definition 3 (Property A). Let P be a distribution and Q be a set of distributions. If 
for all Pj < pk and Q G Q for which qj < g^, there exists a distribution Q' ^ Q such that 
for all i 7^ j, k, q[ = qi and qj + q'- > qk + q'f., then Q possesses Property A w.r.t. P. 

Example 2 (Possession of Property A). Let P be any distribution over Q. Let Qi = 
{U}, where U is the uniform distribution over 0. Then Qi has Property A w.r.t. P since 
qj < qt is never true. Similarly, let Q2 be the set of all distributions (if Q is a distribution 
over 0, then Q £ Q2). Then Q2 also has Property A w.r.t. P. If P / U, and Q3 = {P}, 
then, Q3 doesn't possess Property A w.r.t P. 

The following theorem ensures that there exists an optimal monotone rejection function 
whenever Q satisfies Property A. In such cases the learner's search space can be conve- 
niently confined to monotone strategies. 
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Theorem 4 (Optimal Monotone Hard Strategies). When the learner is restricted to 
hard-decisions and Q satisfies Property A w.r.t. P, then there exists a monotone r G TV^. 

Theorem 4 only concerns the hard setting where r is a zero-one rule. The following 
Property B and the accompanying Theorem 6 treat the more general soft setting. 

Definition 5 (Property B). Let P be a distribution and Q be a set of distributions. If 
for all < < pfc and Q G Q for which < there exists Q' G Q such that for all 

% / j, A;, = qi and ^ > then Q possesses Property B w.r.t. P. 

Example 3 (Possession of Property B). Let P be any distribution over 0. Let Qi = 
{f^}, 0,2 be the set of all distributions and Q3 = {P}. All three sets, Qi, Q2 and Q3, have 
Property B w.r.t. P. 

Recalling our informal investment analogy, if the strategy space of the adversary satisfies 
Property B, then cheaper assets always have the potential for higher ROI (and equally priced 
assets have equal ROI opportunities). If this is the case, then Theorem 6 states that there is 
an optimal investment strategy (that maximizes the overall return), which never purchases 
more of an expensive asset than a cheaper one and always invests identically in equally 
priced assets. 

Theorem 6 (Optimal Monotone Soft Strategies). 

If Q satisfies Property B w.r.t. P, then there exists an optimal strictly monotone rejection 
function. 

Remark 7. It is not hard to prove that a slightly stronger version of Property A implies 
Property B. The stronger version of Property A is that the property also holds when = 
(rather than only for pj < Pk)- 

In the remainder of this section we only consider the hard setting. Theorem 4 tells us 
that there exists an optimal rejection function in the set of monotone rejection functions 
provided that Property A holds. Obviously, to be optimal the rejection function should 
reject as much as possible up to the 5 bound. We now show that if Q is sufficiently rich 
(satisfying Property C below) then any "low-density rejection function" is optimal. 

Definition 8 (Low-Density Rejection Function (LDRF) and Strategy (LDRS)). 

A hard, (5- valid, monotone rejection function r(-) is called a low- density rejection function 
if its p{r, P) is maximal among all hard, monotone (5- valid rejection functions. The strategy 
of selecting any LDRF is called the low-density rejection strategy (LDRS). 

Definition 9 (Property C). Let P be a distribution. We say that the set Q satisfies 
Property C (w.r.t. P) if for each pj = p^ and Q G Q, there exists Q' £ Q such that q'j = 
and = qj , and for all other events, Q' identifies with Q. 
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Some intuition about Property C can be gained by considering some adversary strategy 
space Q. First note that by expanding Q to satisfy Property C the adversary can only be 
strengthened. The property ensures that the adversary can take advantage of situations 
where the learner doesn't identically treat equally probable events under P . When the 
adversary is sufficiently strong in this sense we are able to show that LDRS dominates any 
monotone rejection function. Therefore, if Q also satisfies Property A, in which case there 
exists an optimal monotone rejection function (Theorem 4), then LDRS is optimal. This is 
summarized in the following theorem. 

Theorem 10 (LDRS Optimality). Let r* be an LDRF. Let r be any monotone (5- valid 
rejection function. Then, r* dominates r, 

min /?(r*, Q) > min p(r, Q), (1) 
QeQ QeQ 

for any Q satisfying Property C. Thus, if Q possess both Property A and Property C 
w.r.t. P, then LDRS is hard-optimal. 

Example 4 (Violating Property C Breaks Domination). We illustrate here a viola- 
tion of Property C may result in a violation of the domination inequality (1) in Theorem 10. 
Let TV = 5, P = (0.02,0.03,0.05,0.05,0.85), and b = 0.1. Then the two (5-valid LDRS rejec- 
tion functions are r = (1, 1, 1, 0, 0) and r' = (1, 1, 0, 1, 0). Let Q = {Q ■ qs — QA > e} for some 
< e < 1. Clearly, Q does not satisfy Property C. For any Q G Q, p{r,Q) — p{r',Q) = 
Q3 — Q4: > £) and therefore, minggQ p{r' , Q) < minggg p{r, Q). Thus, the monotone function 
r dominates the LDRF, r' . Hence, LDRS isn't optimal because r' could be chosen. 

6. The Omniscient Adversary: Games, Strategies and Bounds 

We next turn our attention to the power of the adversary, an issue that hasn't been em- 
phasized in the SCC literature, but has crucial impact on the relevancy of SCC solutions 
in adversarial applications. For example, when considering intrusion detection (see, e.g., 
Lazarevic, Ertoz, Kumar, Ozgur, & Srivastava, 2003), it is necessary to assume that the 
"attacking distribution" has some worst-case characteristics and it is important to quantify 
precisely what the adversary knows or can do. The simple observation in this setting is that 
an omniscient and unconstrained adversary, who knows all parameters of the game includ- 
ing the learner's strategy, would completely demolish the learner who uses hard strategies. 
By using a soft strategy, the learner can achieve the slightly better result of 1 — 5 type II 
error (false negative rate). In either case, the presence of such a powerful adversary makes 
the SCC problem trivial and the resulting rejection function is practically worthless. These 
simple results are developed in Section 6.1. 

We therefore consider an omniscient but limited adversary. In seeking a useful and 
quantifiable constraint on Q it is helpful to recall that the essence of the SCC problem is to 
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try to distinguish between two probability distributions (albeit one of them unknown). A 
natural constraint is a lower bound on the "distance" between these distributions. Indeed, it 
is immediately obvious that if P G Q, the adversary can always achieve the maximal type II 
error of 1 — 5 by selecting Q = P. Following similar results in hypothesis testing (see Cover 
&: Thomas, 1991, Chapt. 12), we could consider games in which the adversary must select 
Q such that D{P\\Q) > A, for some constant A > 0, where -D(-||-) is the KL-divergence; 
that is, -D(P||Q)=^^^pj log 2i (Cover & Thomas, 1991). Unfortunately, this constraint 
is vacuous since D{P\\Q) "explodes" when qi <^ pi (for any i). In this case the adversary 
can optimally play the same strategy as in the unrestricted game while meeting the KL- 
divergence constraint. Fortunately, by taking D{Q\\P) > A, we can effectively constrain the 
adversary.^ Instead of only considering the KL-divergence we consider adversary constraints 
using a large family of divergences that include the KL-divergence, the L2 norm and various 
Bregman divergences. Definitions 11 and 13 characterize this family. 

One of our main contributions is a complete analysis of this constrained game in Sec- 
tion 6.2, including identification of the optimal strategy for the learner and the adversary, 
as well as the best achievable false negative rate. The optimal learner strategy and best 
achievable rate are obtained via a solution of a linear program specified in terms of the 
problem parameters. These results are immediately applicable as lower bounds for stan- 
dard (finite-sample) SCC problems, but may also be used to inspire new types of algorithms 
for standard SCC. While we do not have a closed form expression for the best achievable 
false-negative rate, we provide a few numerical examples demonstrating and comparing the 
optimal "hard" and "soft" performance. 

6.1 Unrestricted Adversary 

In the first game we analyze an adversary who is completely unrestricted. This means that 
Q is the set of all distributions. Unsurprisingly, this game leaves little opportunity for the 
learner. For any rejection function r(-), define rmm= niiuj r(z) and Imin{f)={i '■ ^{i) = 
'^min } • For any distribution D, p{r,D) = X]i=i^*'^(0 > ^i=idirmin = rmin, in particular, 
6 = p{r,P) > rmin and mmq p{r,Q) > rmin- By choosing Q such that qi = 1 for some 
i E Imini^), the adversary can achieve p{r, Q) = r^ain (the same rejection rate is achieved by 
taking any Q with = for all i Imin{f))- In the soft setting, ming p(r, Q) is maximized 
by the rejection function r^{i)=5 for all > {r^{i)=l for all pi = 0). This is equivalent to 
flipping a (5-biased coin for non-null events (under P). The best achievable type II error is 

6. Under the investment analogy, requiring that D{P\\Q) be large is equivalent to requiring a small "aver- 
age" value for ^ (giving the learner poor investment opportunities). On the other hand, requiring that 
D{Q\\P) be large is equivalent to requiring that the "average" value of ^ be sufficiently large (providing 
the learner with good investment opportunities, and potentially increasing the value of p(r, Q)). 
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1 — 6. In the hard setting, clearly rmin = (otherwise 1 > 5 > 1), and the best achievable 
type II error is precisely 1. That is, absolutely nothing can be achieved. 

This simple analysis shows the futility of the SCC game when the adversary is too 
powerful. In order to consider SCC problems at all one must consider reasonable restrictions 
on the adversary that lead to more useful games. One type of such a restriction would be to 
limit the adversary's knowledge of r(-), P and/or of 5. Another type would be to directly 
limit the strategic choices available to the adversary. We note that the former type of 
restriction doesn't affect the best achievable type II error, and thus in the next section we 
will focus on the latter. 

6.2 An Omniscient, but Constrained, Adversary 

While we could therefore define Q = Qa={Q : D{Q\\P) > A}, we instead will consider a 
more general family. First, let X be the A^-dimensional simplex: X={{xi, . . . ,X]\f) : Xi > 
0, ^^^Xj = 1}. For convenience, we now define a transfer function, t{X,a,b) — )■ X, where 
X ^ X, and a and b are indices in {1, . . . , A^}, which transfers probability from event b to 
event a, as: 

IXa + Xb i = a, 
i = b, 

Xi otherwise. 

Definition 11 (2-Symmetric). A function /p : — )■ M, is called 2-symmetric if for all 
X £ X and for all j, k such that pj = pk, fp {t{X,j, k)) = fp {t{X, k,j)). 

Remark 12. We note that a Bregman divergence (defined over [0, 1]^) may be 2-symmetric. 
Specifically, define Dp (Q) = Bf{Q\\P)=F{Q)-F{P)-V F{P)iQ-P). Let Ap=F(t(Q, j, /c))- 
F{t{Q, k, j)). Then, the divergence is 2-symmetric if: 

= Dp{t{Q,j, k)) - Dp{t{Q,k,j)) =Af - VF{P) ■ {t{QJ, k) - t{Q,k,j)) 

--(".-')(^-^)- 

We note that if F{X) = YliLifi^i)^ where /(•) is a strictly convex function, then clearly 
the Bregman divergence is 2-symmetric. 

Definition 13 (Receding). A function /p : A' — M, is called receding if for all X £ X, 

Pj < pk and Xk > 0, fp{t{X,j, k)) > fp{X). A receding function Dp : A' — M is called a 
receding divergence if it is defined over the domain [0,1]''^, it is differentiable over (0,1)^ 
and is strictly convex. 
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Remark 14. We note that a Bregman divergence may be a receding divergence, as well. 
Specifically, define Dp{Q) = Bf{Q\\P)=F{Q) - F{P) - VF{P) ■ {Q - P). This trivially 
meets the differentiability and strict convexity requirements. Let us examine if it is receding. 
Let pj < pk, Qk > and let A=t{Q,j, k) — Q. Then, in order to satisfy the property: 

< Dp{tiQJ, k)) - Dp{Q) =F{Q + A) - F{Q) - VF{P) • A 

fdF{P) dF{P)\ 



--F{Q + A) - F{Q) + Qk 



\ dxk dxj J ' 



We note that if F{X) = Y^iLifi^i)^ where /(•) is a strictly convex function, then F{t{X,j, k)) 
F(t(X, k,j)) for all j, k, and thus, by convexity: 

F{Q + A) - F{Q) = F{t{Q, j, k)) - F{Q) > 

dF{P) dF{P) , , ^ 

^-^ = ^(^^)-^^^^)>°- 

Thus, Bregman divergences which are of this form, such as the squared Euclidean distance 
DpiQ) = HQ ~ E'-iid the KL-Divergence, are also (2-symmetric) receding divergences. 
Note that this condition is sufficient and not necessary. It is certainly possible for Bregman 
divergences which are not of this form to be receding divergences as well. 

We define Qa={Q : Dp{Q) > A}, where Dp{-) is a 2-symmetric receding divergence. 
We say that a distribution Q meets the divergence constraint if Dp{Q) > A. As we will 
shortly see, this is consistent with an adversary that can't eavesdrop on the user, as the 
constraint prevents the adversary from selecting distributions which are only concentrated 
on high-probability events under P. 

Lemma 15. Qa possesses Properties A and B w.r.t. P. 

Proof Let j, k be such that pj < p^. For any distribution Q G Q\ we define Q' = t{Q, j, k). 
If Pj < pk, then since Dp{-) is receding, Dp{Q') > Dp{Q) > A. Otherwise, if pj = pk, since 
Dp{-) is 2-symmetric and convex, Dp{Q') > Dp{Q) > A. Thus, in either case, Q' G Q\. If 
Q is such that qj < qk, then q'j + qj = 2qj + q^ > Qk = Qk + Qk, and Qa has Property A. If 
Q is such that ^ <, ^ then — = iii^ > q = -2^^ and Qa possesses Property B. ■ 

Pj Pk Pj Pj Pk 

Therefore, by Theorems 4 and 6 there exists a (strictly) monotone r € TZg in the hard 
(respectively, soft) setting. If Qa has Property C as well, then by Theorem 10 any (5-valid 
LDRF is hard-optimal. It is easy to verify that Bregman divergences of the form described 
in Remark 14 possess Property C. 

We now define X^^^ as the distribution which is completely concentrated on event j. 
In other words xp''=I(i = j), where I(-) is the indicator function. We assume that < 
Pi ^ P2 ^ • • • ^ Pn- Therefore, since Dp{-) is receding, Dp (X^^)) > Dp (X^^)) > ••• > 
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Dp (X^^)). Therefore if Dp (X^'^)) > A, then any Q that is concentrated on a single 
event meets the constraint Dp{Q) > A. Then, the adversary can play the same strategy 
as in the unrestricted game, and the learner should select as before. For the game to 
be non-trivial it is thus required that A > Dp (X^^^). Similarly, if the optimal r is such 
that there exists j G Imin{f) (that is r{j) = Tmin) and Dp [X^^^i^ > A, then a distribution 
Q that is completely concentrated on j has Dp{Q) > A and achieves p{r,Q) = Vmin, as in 
the unrestricted game. Therefore, r = r^, and so maximizes r^^n* This yields the following 
definition: 

Definition 16. A rejection function r is called vulnerable if there exists j £ Imin{f) such 
that Dp (X(^)) > A. 

We begin our analysis of the game by identifying some useful characteristics of optimal 
adversary strategies under the assumption that the chosen rejection function isn't vulner- 
able. These properties, that are stated in Lemma 18, are then used to prove Theorem 19 
showing that the effective support of an optimal Q has a size of two at most. Based on these 
properties, we provide in Theorem 23 a linear program that computes an optimal rejection 
function (under the assumption that it isn't vulnerable). Finally, in Lemma 24 we show that 
the solution computed by the linear program is if it is vulnerable, giving optimal (though 
trivial) performance. Thus, in any case, the output of the linear program is optimal. 

If A > Dp (X^^^), then no adversary distribution can meet the divergence constraint. 
We therefore limit ourselves to cases where A< Dp (X^^)). We can now divide the events 
in 17 into two groups: H and L, such that H = {i : Dp (X(*)) < A} wd L = n\H. We 
note that the assumption that r isn't vulnerable implies that Imin{f) ^ H. By definition, 
yh £ H,l £ L, we have that ph > pi- 

Lemma 17. HQ meets the divergence constraint, there exists an event i £ L for which 
Qi > 0. 

Proof Let us assume that qi = for all i £ L. Let j be the smallest event in H. Since 
Dp{-) is receding, Dp{Q) < Dp (X^^)) < A. Contradiction. ■ 

Lemma 18. Let r be a rejection function which isn't vulnerable. If Q meets the divergence 
constraint and minimizes p{r,Q'): 

i. Dp{Q) = A; 

ii. Let u, V be two indices in {1, . . . , N}. Define Q" = t{Q, u, v). If > and Dp{Q") > 
A, then r{u) > r{v). Furthermore, r{u) = r{v) Dp[Q") = A; 

iii. pj < pk and qk > ^ r{j) > r{k); 
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iv. pj < pk and qj,qk> 0^ 



dDp{Q) 

dXj 



> 



dDp{Q) . 



V. qj.qk > 0^ Pj ^Pk] 

vi. Pj < pk and (?j > ^ Dp{Q) > Dp{t{Q, k,j)). 
Proof 

i. Assume that Dp{Q) > A. By Lemma 17 there exists a non-empty set LQ={i G 
L \ qi > 0}. Let hmax = argmaXjgj^^^(^) g'i. Clearly, hm.ax G H. We define a new 
distribution Q* , which is identical to Q except that probability is transferred from 
events in Lq to hmax, in order to make Dp{Q*) = A (this is possible, since Dp{-) 
is continuous and, by Lemma 17, transferring all probability from Lq to hmax 

would 

result in Dp{-) < A). Since transferring any probability from i E Lq to hmax results 
in making p{r, Q) smaller, p{r, Q*) < p{r, Q), contradicting the fact that Q minimizes 



ii. We note that p{r,Q") = p{r,Q) — qv{r{v) — r{u)). Since p{r,Q) is minimal and 
Dp{Q") > A it follows that r{u) > r{v). If r(n) = r{v) then p{r,Q") = p{r,Q), and 
by part (i), Dp{Q") = A. 

iii. By part (ii), taking u = j and v = k we trivially get r(j) > r{k). Furthermore, since 
Pu = Pj <Pk=Pv^ Dp{Q") > A, r{j) / r{k). Thus, r{j) > r{k). 

iv. Assume, contradictorily, that ^^^J^^ < ^^dx^^ ■ < e < mm{qj , qk} ■ We define 
ej-fe = e (X^'') - Then, by convexity: 



Therefore, by defining Q' = Q + ej^k, we have that Dp{Q') > Dp{Q) > A. Further- 
more, by part (iii), r(j) > r{k). Therefore, p{r,Q') = p{r,Q)+e{r{k) — r{j)) < p{r,Q). 
Contradiction. 

V. Assume that pj = pk- We consider two cases. In the first case, r(j) < r(k), w.l.o.g. By 
defining u = j, v = k, from part (ii) we get that r(j) > r(k), which is a contradiction. 
In the second case, r(j) = r{k). However, since both qj and qk are greater than zero, 
defining u = j and t; = A; in part (ii) gives us that Dp{Q") > A, which is again a 
contradiction. 



p{r,Q'). 



Dp{Q + ej- fc) > Dp{Q) + VDp{Q) • e^- ^ 




> Dp{Q). 
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vi. If = then Q = t{t{Q, k, j), j, k) and Dp{Q) > Dp{t{Q,k,j)). Otherwise, Qk > 
and by part (iii), r(j) > r{k). If we assume contradictorily that Dp{t{Q,k,j)) > 
Dp{Q) = A, then by part (ii), taking u = k and v = j, r{k) > r{j). Contradiction. 



Theorem 19. If r isn't vulnerable, then any optimal adversarial strategy Q has an effective 
support of size at most two. 

Proof Let us assume, by contradiction, that the theorem's statement is wrong; that is, 
there exists an optimal Q* that has J > 2 events for which q* ^ 0. W.l.o.g. we rename 
our events such that these are the first J events. We note that Q* is a solution (i.e., global 
minimum) to the following problem (*): 



minimize p{r,Q) = ^^r(i)q'j, subject to: 



i=l 

J 



J2Qi = 1, Dp{Q) = A, 

i=l 

0<qi<l, i G {!,... ,J}. 
We will now prove that Q* does not in fact solve the problem. We do so in two parts: 

1. We show that Q* is the unique global maximum of the Lagrangian of (*). 

2. We show that there exists a different distribution Q with the same effective support, 
which meets the equality constraints. We therefore conclude that p{r,Q) < p{r,Q), 
contradicting the optimality of Q*. 

We now prove the first part. The Jacobian matrix for the equality constraints at Q* is: 



I dDpjQ*) dPpjQ*) dDp{Q*) dDpiQ") I • 

\ dx\ 8x2 ■ ■ ■ dxj / 

Since all q* > 0, by parts (v) and (iv) of Lemma 18, for all j, k < J: pj ^ pk and 
^^dx^ ^ 7^ ^—§^-^- Therefore, the gradients of the constraints are linearly independent at 
Q* and therefore, since Q* is (at least) a local minimum to the problem (*), there exists 
a unique Lagrangian multiplier vector A = (Ai,A2) such that Q* = {ql,q2, ■ ■ ■ iQj) is an 
extremum point of the Lagrangian: 

HQ, A) = 5^r(i)(?i + Ai {Dp{Q) - A) + A2 (^^qi - l \ . 

i=l \i=l / 
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The partial derivatives are: ^^^^ ''^^ = r(i) + ^i^—§^-^ + A2 = 0. Therefore, for all 
j,ke {!,..., J}: 

r{k) - r{j) 



dDp{Q*) _ dDpjQ') 
dxj dxf^ 



If we assume (w.l.o.g.) thatp^ < pj, then, from parts (iii) and (iv) of Lemma 18, r{k) > r{j) 
and ^^g}^ ^ > ^^Qx^ ^ ■ Thus, Ai < 0. Therefore, due to the strict convexity of Dp{-) 
and the linearity of the other two equations, the Lagrangian L{Q, A) is strictly concave. 
Therefore, since Q* is an extremum point of the (strictly concave) Lagrangian function, it 
is the unique global maximum. 

We now wish to show that there exists some other distribution Q that meets the diver- 
gence constraint and has the same support as Q* . We define Q^^^ as q}"^^ = l{i > 3)ql and 
ci23='7i + '?2 + ^s- Then we define: 

9{qi,q2)= Q''' + qiX^^^ + q2X^^^ + (C123 - qi - q2)X^^^ 
f{qi,Q2)= Dp{g{qi,q2)) - A 
^ for ^ e {1,2} : ^1^^ =VDp {g{q,,q,)) • (x« - X^^)) 

^ dDp{g{qi,q2)) _ 0£>p(g(gi, 92)) 
dxi 8x3 

Clearly, g{ql,q2) = Q* and f{ql,q2) = 0. From part (iv) of Lemma 18, we have for 
ie{l,2}: 

df{ql,q*2) ^ dPpjQ*) dDpjQ*) ^ 
dqi dxi dxs 

Therefore, / is smooth in the open, convex domain {qi,q2 > 0} n {qi + q2 < C123} and 
has a root in this domain at {qi,q2) at which none of its partial derivatives are 0. Then, 
there exist an infinite number of points in the domain for which / = (this is true for 
any sub-domain for which {ql,q2) is an interior point). Let (^1,^2) 7^ (^'1592) 
these points. Then, the distribution Q = {qi,q2, C123 — Qi — Q2,ql,q5, ■ ■ ■ , q*j) / Q* satisfies 
D(Q,P) = A and has the exact same effective support as Q* . Therefore, Q meets the 
equality criteria of the Lagrangian. Since Q* is the unique global maximum of L{Q,\): 
p(^) Q) = L{Q, A) < L{Q*, A) = p{r, Q*), contradicting the fact that Q* is optimal. ■ 

We now turn our attention to the learner's selection of r(-). As already established by 
Lemma 15 and Theorem 6, it is sufficient for the learner to consider only strictly monotone 
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rejection functions. Since for these functions pj = Pk ^ = '''(k), the learner can 
partition il. into K = K{P, event subsets, which correspond, by probabihty, to "level 
sets", 5i, ^2, . . . , Sk (all events in a level set Sj have probability p^^^^). We re-index these 
subsets such that < p('^i) < ^(■^2) < ... < pi^K) _ Define K variables ri,r2, ■ ■ ■ ,rK, 
representing the rejection rate assigned to each of the K level sets (Vw G Si,r{uj) = r^). 
Since Dp{-) is 2-symmetric, Dp (X^^^^) is constant for all w in a level set S. Therefore, we 
use the notation Dp=Dp (X^"^)) for any u € S. We group our level sets by probability: C = 
{S : Df,> A}, M = {S : Dl, = A}, and U = {S : Df, < A}. We define w= argmax^jSi e 
C[jM}. 

Lemma 20. If Q minimizes p{r,Q) and meets the constraint Dp{Q) > A, then > 
p{r,Q). 

Proof Let j £ 5*^,. Then Dp (X^-')) > A, and since Q minimizes p{r, Q), = p (r, X^^^^ > 
p{r,Q). ■ 



By Theorem 19, if r isn't vulnerable, the adversary-optimal Q will have an effective 
support of at most size 2. If it has an effective support of size 1, then the event uo for which 
Qijj = 1 cannot be from a level set in £ or "H (otherwise, part (i) of Lemma 18 would be 
violated). Therefore, it must belong to the single level set in A^. Thus, \i Jv[ = {Sm} (for 
some index m), there are feasible solutions Q such that q^^ = 1 (for uj G Sm), all of which 
have p{r,Q) = r^- The following lemma characterizes optimal distributions Q which have 
an effective support of size 2. 

Lemma 21. If r isn't vulnerable and Q is optimal with an effective support of size 2 (that 
is, there are j, k such that > and qj + qu = 1), then, assuming w.l.o.g. that pj < pk, 

j £ Si G C and k £ S^ & for some / and h. 

Proof Since qj,qk > 0, and Q is optimal, we have that pj ^ Pk, by part (v) of Lemma 18. 
Therefore, pj < pk, and by part (vi) of Lemma 18, 

Dp (xW) = DpitiQ,k,j)) < Dp{Q) < Dp{t{Q,j,k)) = Dp (x^^)) . 

Assume, by contradiction, that k belongs to a level set in £ or This is equivalent 
to Dp (XC^)) > A. We therefore have that Dp{Q) > Dp (X^^)) > A, which is a violation 
of part (i) of Lemma 18. Therefore, k belongs to a level set in Ti. Likewise, were we to 
assume that j belongs to a level set in or {Dp (X^-^^) < A), it would follow that 
Dp{Q) < Dp (X^-'^) < A, which would also violate part (i) of Lemma 18. Therefore, j 
belongs to a level set in C. ■ 
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Lemma 22. Let Si £ C and Sh G Ti- Then, there always exists a single solution q^' £ 
(0, 1) to 

Dp + =A, 

for any j G Si,k G Sh- 

Proof Let Q be a distribution with an effective support of size 2, where the events j, k 
for which qj,qk > are such that j G 5; and k £ S^- Furthermore, let qj = q and 
qk = l-q. Define g{q)^g{qj, k)=Dp [qX^^) + (1 - q)X^^)). Then, g{q) = Dp{Q). We 
note that g(0) = < A and g{l) = Dp > A. Thus a solution, q*, exists in the range 
(0, 1). Since g{q) is continuous and convex, there cannot exist another solution in this range. 
Let X = q*X^i^ + (1 - q*)X^^\ Let j' G Si and k' G Sh- Then, since Dp{-) is 2-symmetric, 
A = Dp{X) = Dp{t{X,f,j)) = Dp{t{X,k',k)) = Dp{t{t{X,j',j),k',k)), and thus the 
solution is the same for all pairs of members between Si and Sh- ■ 



Therefore, if an adversary-optimal Q has an effective support of size 2, where the events 
with non-zero probability are from 5; and Sh respectively, then, p{r,Q) = p^^'^^=q^l^^^ ri + 

(1 - 

Therefore, the adversary's choice of an optimal distribution, Q, must have one of |£| |?^| + 
< \_^\ (possibly different) rejection rates. Each of these rates, pi,p2, - - - : P\c\\n\+\M\7 
is a linear combination of at most two variables, rj and rj. We introduce an additional 
variable, z, to represent the max-min rejection rate. This entails the following theorem. 

Theorem 23. An optimal soft rejection function and the lower-bound on the optimal type 
II error, 1 — z, is obtained by solving the following linear program: 

maximizer-i,r2,.--,rx,2 subject to: 

K 

Y,r^\SMS^) = S (2) 

i=l 

^ ^ ri > r2 > ■ ■ ■ > rx ^ 

rw> z 

Pi>z,ie {l,2,...,\C\\n\ + \M\}. 

Let r* be the solution to the linear program (2). Our derivation of the linear program 
is dependent on the restriction that r* isn't vulnerable. If r* contradicts this restriction 
then, as discussed, the optimal strategy is r^- The following lemma shows that in this case 
r* = anyway, and thus the solution to the linear program is always optimal. Its proof 
can be found in Appendix B. 

Lemma 24. Let r* be the solution to the linear program. If r* is vulnerable, then r* = r^. 
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Remark 25. We attempted to determine explicit bounds on the value of 1 — z, the optimal 
type II error, that would result from solving the linear program in Theorem 23, including 
via examining the dual form of the problem, but were unsuccessful. If the optimal rejection 
function r* ^ then one can prove several interesting properties, some of which we have 
proven in Lemma 18, which may be of use in determining bounds on the optimal type II 
error. However, as the following example illustrates, even determining whether or not the 
optimal solution outperforms is not trivial. 

Example 5. Let P = {0.05, 0.05, •■ • ,0.05,0.2}, 6 = 0.2, A = 3 and Dp{-) = D{-\\P) be 
the KL-divergence. Then, solving the linear program gives (it is possible that other 
solutions exist, however). Interestingly, changing 6 does not appear to change the result 
(even when taking values as small as 5 = 0.001, or as large as 5 = 0.999). Furthermore, if 
we increase A to 3.2, we achieve solutions to the linear program which aren't r^, but do not 
improve on its rejection rate (again for the same range of 6 values). 
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Figure 1: Type II error vs. A, for = 50 and 5 = 0.05. 50 distributions were generated 
for each value of A (A = 0.5,0.1, • • • , 12.5). Error bars depict standard error of 
the mean (SEM). 



6.2.1 Numerical Examples 

We numerically compare the performance of hard and soft rejection strategies for a con- 
strained game, where D(Q\\P) > A, for various values of A, and two different families of 
target distributions, P, over a support of size = 50. The families are arbitrary probabil- 
ity mass functions over N events and discretized Gaussians (over N bins). For each A we 
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generated 50 random distributions P for each of the famihes. For each such P we solved 
the optimal hard and soft strategies and computed the corresponding worst-case optimal 
type II error, 1 — p{r, Q). 

Since maxg D{Q\\P) = log(l/minjpj), it is necessary that miuj^j < when generat- 
ing P (to ensure that a A-distant Q exists). Distributions in the first family of arbitrarily 
random distributions, Figure 6. 1(a), are generated by sampling a point (pi) uniformly in 
(0,2"^]. The other - 1 points are drawn i.i.d. ~ U{0, 1], and then normalized so that 
their sum is 1 — pi- The second family, Figure 6. 1(b), are Gaussians centered at and dis- 
cretized over N evenly spaced bins in the range [—10, 10]. A (discretized) random Gaussian 
N{0,a) is selected by choosing a uniformly in some range [o'mimO'max]- o'mm is set to the 
minimum a ensuring that the first/last bin will not have "zero" probability (due to limited 
precision). (Tmax was set so that the cumulative probability in the first/last bin will be 2""^, 
if possible (otherwise amax is arbitrarily set to 10 * cfmin)- 

The results for 5 = 0.05 are shown in Figure 6.1. Other results (not presented) for a wide 
variety of the problem parameters (e.g., A^, 5) are qualitatively the same. It is evident that 
both the soft and hard strategies are ineffective for small A. Clearly, the soft method has 
significantly lower error than that of the hard (until A becomes "sufficiently large"). 

7. Low Density Rejection in a Continuous Setting 

In Section 5 we presented a number of results on LDRS optimality in a simplified finite and 
discrete setting. In this section, we reconsider LDRS (now only in the hard setting) in a 
much more general framework where the learner and adversary distributions are infinitely 
continuous. After defining this general setting we extend theorem 10 of Section 5 on hard 
LDRS optimality. The resulting Theorem 30 is obtained by assuming that the adversary 
strategy space is sufficiently large, now satisfying a continuous extension of Property A 
called Property Acont (Property C is not required in the continuous setting). 

The main contribution of this section is a reduction of the SCC problem to two-class 
classification problem. The two-class classification is facilitated by sampling points from a 
synthetically generated "other class." This other class is generated so that it is uniform 
over its support, which is appropriately selected around the observed support of P. Using 
this synthetic sample we obtain a binary training set on which we can train a soft binary 
classifier. The final (5-valid SCC classifier is then identified by selecting a threshold on the 
classifier output so as to maximize the type I error up to 5. The entire routine is simple, 
practical and if the underlying two-class soft classifier learning algorithm runs in C (n) time 
complexity, our SCC algorithm runs in time 0{C{n) + n). An alternative approach where 
a hard two-class classifier can be used is described by Nisenson (2010). 
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We show that the SCC routine obtained using this approach is consistent in the sense 
that if the underlying classification device is consistent then the resulting one-class classifier 
is asymptotically an LDRF, thus providing an optimal SCC solution when the adversary 
strategy space satisfies Property Acont- 

7.1 Definitions 

The SCC problem in the continuous setting is essentially the same as in the finite case (see 
Section 2) but now both the source distribution P and the adversary distribution can be 
infinitely continuous distributions over W^. Let A be the Lebesgue measure on W^. We 
assume that P is absolutely continuous with respect to A (in other words, if a Borel set 
b has zero volume in M'^, then P{b) = 0). Denote by p the density function of P and let 
supp(p) be its support in M."^. 

We define the function Ii){x)=I{x G b), where I(-) is the indicator function. For a Borel 
set b, we define lp{b)=b\J{x : p{x) = 0}. 

Definition 26 (Minimum Volume Set). A set b C supp(p) is called a minimum volume 
set of measure 1-5 if P(6) = and for all 5' such that P(6') = P{b) = 1-6, X{b) < X{b'). 

Definition 27 (Low Density Set). 

(i) Let b C supp(p) be a minimum volume set of measure 1 — 6. Let m be any set such 
that P{m) = 6 and 5P|m = 0. Then, we call m a core low density set w.r.t. P and 6, 

(ii) Denote by coies{P) the set of all core low-density sets w.r.t. P and 6. 

(iii) We call a set s a low density set w.r.t. P and 6 if there exists an m G cores{P) such 
that s = lp{m). 

7.2 LDRS optimality in the continuous setting 

Definition 28 (Low-Density Rejection Strategy (LDRS) and Function (LDRF)). 

We define 

LDRSs{P)= {r(-) : 3m € coves{P) s.t. r(-) = Ii^(^)(-)} • 

Any function r(-) G LDRSg{P) is called a (5-tight Low-Density Rejection Function (LDRF), 
and the Low-Density Rejection Strategy is to choose any (5-tight LDRF. 

Definition 29 (Property Acont)- We say that two Borel sets j, k satisfy condition (*) if: 
(i) j,kC supp(p); (ii) J n A; = 0; (iii) P{j) = P{k)- and (iv) A(j) > \{k). 

An adversary strategy space Q has Property Acont w.r.t P, if for every pair j, k satisfying 
(*): VQ G Q such that Q(j) < Q{k), 3Q' G Q, for which 
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1. Q'{j)+Q{j)>Q'{k)+Q{k); 

2. For all Borel sets b for which bf]{j{Jk) = 0, Q'{b) = Q{b). 

The proof of the following theorem can be found in the appendix. 

Theorem 30. When the learner is restricted to hard-decisions and Q satisfies Property Acont 
w.r.t. P, then LDRS is optimal. 

7.3 sec via Two-Class Classification 

We propose an SCC routine that relies on a soft binary classifier induction. We can use 
any two-class algorithm, which is consistent in the sense that it minimizes a loss function 
(/)(•) that is non-negative, differentiable, convex, strictly convex over [— oo,0) and satisfies 
4>'{0) < 0. These conditions are similar but stronger than the conditions required by 
Bartlett, Jordan, and Mcauliffe (2006), which provide necessary and sufficient conditions 
for a convex cf) to be classification- calibratedJ We note however that the commonly used 
loss functions as discussed in Bartlett et al. (2006) satisfy our conditions, including the 
quadratic, truncated-quadratic, exponential and logistic loss functions, to name a few. In 
the extensions to this section (see Nisenson, 2010) an SCC routine is presented that can 
utilize any hard binary classifier induction algorithm that minimizes either the 0/1, Li, or 
hinge loss functions, as well as any of the loss functions defined by Bartlett et al. (2006).^ 

Our SCC algorithm is given a training sample Sn = {xi, . . . , x„} of n training examples 
drawn i.i.d. from an unknown source distribution P over M"'. Given a type-I threshold 6 the 
algorithm outputs a hard rejection function r(-) over R"^. The main idea of the algorithm 
is based on the following observation. If our domain is bounded, we can define a two-class 
classification problem where the first class is P and the other class is a uniform distribution 
over the (bounded) domain. Then, the output of a consistent soft binary classifier is strictly 
monotonically increasing with p(-) (the density of P) over the support of P (it is only weakly 
monotone in p(-) over the whole domain). Therefore, thresholding the classifier's output, 
with an appropriate quantile, identifies a 5- valid level-set in P, inducing a rejection function. 

In practice, sampling from a uniform distribution over large domains is computation- 
ally hard and moreover, undefined for unbounded domains. Our algorithm avoids these 
obstacles by sampling uniformly in grid cells containing sampled points from P. An ad- 
ditional complication arises in cases where the density p is flat over some regions, which 
results in discontinuities of the level sets. This is a known issue in level set estimation and 

7. Our additional conditions are differentiability everywhere and strict convexity over [— cxd,0) . The reason 
for these extra conditions is that we threshold the soft classifier's output and don't merely use its sign 
for classification. 

8. The use of a hard classifier (as opposed to a soft one) results in a time complexity penalty of a factor of 
O(logn). 
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is often avoided by assuming that there are no flat regions in p, in particular in regions 
corresponding to the 5 level set (Tsybakov, 1997; Molchanov, 1990). We don't assume this; 
our algorithm handles flat regions in p by jittering the classifier output using a small and 
vanishing (in n) random noise (see step 6 in the algorithm below). The resulting algorithm 
is computationally efficient and practical. 

A major component of our algorithm is determining a threshold by quantile estimation. 
This occurs in Step 7 of the algorithm. We apply a known estimator (Uhlmann, 1963; 
Zielihski, 2004) that is unbiased and has certain optimal characteristics (see below). This 
quantile estimator assumes that the cumulative distribution function (cdf), F, underlying 
the sample, is continuous, and is defined over M (i.e., F is the cdf of a real random variable). 
Let tfj^ be the estimate of the ^-quantile of F, given n sample points drawn i.i.d. according to 
F. The estimator is unbiased if Ep[F{tf^)] = fi. Its variance is Varp[F{t^)]. The estimator 
we use is called the "uniformly minimum variance unbiased estimator." It was introduced 
by Uhlmann (1963) and we rely on analysis by Zielihski (2004). This estimator can only be 
used for estimating ^u-quantiles that satisfy < < , which is equivalent to requiring 
that n > max I ''-^l- The estimator chooses an index vr^ in [1, . . . , n], and the estimate 
of the /i-quantile is the vr^-th order statistic; in other words, if our sample points are sorted 
in increasing order, then the estimate is the vr^-th element, vr^ is calculated as follows: 

• Set k=l{n + l)fi\. 

• Set li={n + l)/i — k. 

• With probability /3, set vr^ = /c + 1, and with probability 1 — /3, set = k. 

The estimator's variance is (Zielihski, 2004): 

/3(l-/3) MI-m) 
(n + l)(n + 2) n + 2 

The variance is maximized when /3 = /i = i, and thus the variance is at most ^^^i^ ■ 
Moreover, according to Zielihski (2004), the estimator is unbiased and its variance is not 
greater than that of any other unbiased estimator within the family of estimators that can 
be defined using a probability distribution over single order statistics. For very small sam- 
ples with n < max|Y^, ''^l) we "fall-back" to a simple "default" estimator, which sets 

7r^=[n/i]. We term this quantile-estimation algorithm the "uniformly minimum variance 
unbiased (with fall-back) estimator," or the "UMVUFB estimator." 
The algorithm is as follows: 

1. Define a grid over M'^ with arbitrary origin and positive cell side length gn. Let 

, ^ ]_ 

gn — ?• 0, be such that ng — t- oo. For example, gn=n ''+2. Select an arbitrary 
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origin xq, for example, uniformly at random from the unit-liypercube. For any point 
X = {x^^\ . . . , x^^^), define the function 



X — Xq 


'( 












9n 




9n 




9n 


, . . . , 


9n 



An{x) 



For each point x, An{x) specifies the coordinates of the "lower left" corner of the grid 
cell containing x. 

2. Define the set Gp = [JxeSn ^"(^) covered grid cell corners. 

3. Generate an artificial sample On of size n from the "other class." Each point is selected 
independently at random as follows: 

(a) Choose a £ Gp uniformly at random. 

(b) Choose a point x uniformly at random from the unit-hypercube. 

(c) The new artificial sample point is o — a -\- • x. 

4. Using the training sample consisting of 5„ (labeled +1) and 0„ (labeled —1), train a 
soft binary classifier hn{-)- 

5. Define a confidence margin for the 5 threshold. Select any 0„ — )■ oo such that On = 
o{y/n), for example, take 9n=\/n. Now define (5^=5 + Choose 6~ < S — he 
such that 5" — 5. 

6. Jitter the classifier output. Let Xp be a random variable where Xp ~ P and 
Yn=hn{Xp). Let $(•) be the cumulative distribution function of A^(0, 1), and let m„ 
be such that = o^^^, for example m„=e"'. Let o"n=o^^^, for example, 

<^n=^ = e-^*^. Let e ~ N[0, a^], and set Z„=y„ + e. 

7. We use the following threshold mechanism. We will select two thresholds t~ and 
on Zn- The cutoff is always t~ and it is inclusive when t~ < t^- Specifically, let t~ 



and be estimates of the ( $(mn)(5„ + ^^-y^ ] -quantile and ^$(m„)(5^ 



+ 



quantile of Zn, respectively. In order to establish these estimates we require a sample 
from Zn- The following procedure produces a list of sample points Sz- 

• Set Sz = [], i.e. Sz is an empty list. 

• For each x G Sn- Choose a value ~ A^[0, cr^] and append the value hn{x) + 
onto Sz- 



The sample Sz is then the input to the UMVUFB estimator defined above. 



28 



Foundations of Adversarial Single-Class Classification 



8. Define the rejection function 

1 An{x) Gl- 



rnix) = 



l{hn{x) < t;^) An{x) G and t~ <t+; 
I{hn{x) < t~) otherwise. 

Remark 31. Instead of a soft classifier, hn{-) could have been any consistent class-probability 
estimator, where hn{x) is the estimate of Pr{-|-l|x}. See (Nisenson, 2010) for details, /in(-) 
could also be a consistent ranking algorithm (see, e.g., Clemengon, Lugosi, &: Vayatis, 2005). 
In this case, the quantile estimator must select a single sample point to represent the quan- 
tile. All comparison operations (e.g. <, <), including those done by the quantile estimator, 
must be performed by the ranking algorithm. The ranking algorithm must also be able to 
distinguish between t~ < t+ and t~ = t^. 

Let Un (with density Un) be the distribution of 0„ (defined in Step 3). Clearly, Un is 
uniform over its bounded support. As previously noted, if the support of the generated 
distribution is significantly larger than that of P, an exorbitant number of points may need 
to be generated in practice in order to reject low density areas in P (Davenport et al., 2006). 
The following lemma shows that the probability of generating points outside of p's support, 
almost surely tends to zero. 

Lemma 32. Uni^'^ \ supp(p)) ^ 0. 

Proof Recall that gn is a sequence of positive numbers such that lim^_>oo 

ng^ = oo and 

lim„_^oo S'n = 0. Define a sequence of positive numbers g'^, such that g'^ > 2gn, g'^ — )• 
and lim„^oo = oo. Define A{x,g'^)={y : y G M"' and ||x - y\\oo < g'n}- Define 
Tn= [J^^i A{xi, g'^). Devroye and Wise (1980) show that for any probability measure u on 
the Borel sets of M*^ whose restriction to supp(p) is absolutely continuous w.r.t. P, it holds 
that z/(T„ A supp(p)) 0. We note that the grid cell of x is always a sub-region of A{x, g'^), 
and therefore supp(ii„) C T„. Thus, noting that C/„(M'^ \supp(p)) = [/„(supp(u„)\supp(p)), 

C/„(sUpp(u„) \ SUpp(p)) < SUpC/rn.(sUpp(u„) \ SUpp(p)) 

m 

< sup UmiTn \ SUpp(p)) 

m 

< sup UmiTn A snpp{p)) 

m 

lim [/„(supp(n„) \ supp(p)) < lim sup Um.{Tn A supp{p)) 

= sup lim Um{Tn A supp{p)) 
=• 0. 
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Remark 33. It is difficult to establish exact convergence rates in Lemma 32 without con- 
straints on P. For cases where A(M'^ \ supp(p)) = 0, we obviously have that Un{^'^ \ 
supp(p)) = 0. This is the case, for example, for finite mixtures of Gaussians. 

If there exists a constant K, such that \p{x) — p{y)\ < K whenever ||x — y||oo < 5rn we 
can establish an upper bound on the rate. The condition ||x — y||oo < dn is equivalent to x 
and y being in the same grid-cell. Therefore, if p{x) > K, then for all y in the same grid- 
cell, p{y) > 0. Note that if p is Lipschitz continuous such that \p{x) — p{y)\ < 



— \ 



y\ 



7] he a desired confidence level. Let be the 

In^) gg^j^pjg points. 



then p meets the above condition. Let 1 

number of cells in the grid which contain more than Kng'^ + 
Then, using Hoeffding's inequality, it isn't hard to show that with probability at least l — r], 
C/„(M'^\supp(p))<l - |g|. 

Definition 34 (Quantile). Let ^ be a random variable whose domain is in M. We say 

that t is a p- quantile of ^ if 

t G S^in)= {r G M : Pr{^ < r} < fi and Pr{^ < t} > p} . 

Define a new random variable k to represent the level sets of P. Formally, its cumulative 
distribution function is Fn{t) = P{{x : p{x) <t}). 

Definition 35. Let v be any 5-quantile of n. We say P has a 6-jump if Fi^{v) > 5. 





V K V K 

(a) (b) 

Figure 2: The cumulative distribution function when P doesn't have a 5-jump. 

We now will consider two cases, one where P doesn't have a 5-jump and one where it 
does. See Figure 2 and Figure 3. In all figures the (unique) (5-quantile of k is marked by 
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Figure 3: The cumulative distribution function Ff^ when P does have a 5-jump. 

V. Note that a quantile need not be unique, in particular there will be a range of values 
wherever is flat. 

Definition 36. A rejection function r(-) is called a 5-maximal level-set estimator for P if, 
for some v G Sn{5), either: 

1. P doesn't have a 5-jump and r[x) = l{p[x) < v), almost everywhere. 

2. P has a (5-jump and r{x) = I{p{x) < v), almost everywhere. 

Note that if P doesn't have a 5-jump, then a (5-maximal level-set estimator for P is a 
(5-tight LDRF. We will now prove that the output of the algorithm is asymptotically (almost 
surely) a ^-maximal level-set estimator for P. 

Theorem 37. Let {U^^}, n = 1, 2, . . . , be a sequence of probability measures such that for 
each n, has uniform density u'^ over its bounded support, and lim„_i.oo P(supp(ti'„)) = 1. 
Define a Bayesian binary classification problem for each n. Let the first class, ci = +1 
have distribution P, and the second class C2 = — 1 have distribution U^. The classes' prior 
probabilities are Prj+l} = Pr{ — 1} = ^. Let </>(•) be a non-negative, differentiable, convex 
loss function such that it is strictly convex on [— oo,0) and (j)'{0) < 0. Let h'!^{-) be the 
soft Bayes-optimal classifier that minimizes the expected loss. Define a random variable 
Y*=h^{Xp). Let t* be a 5-quantile of Y*. Define the rejection function: 



l{hl{x)<tl) 



1 



X supp«); 

X G supp(u'„) and P doesn't have a J-jump; 



otherwise. 



Then, r*{-)= lim. 



r*(-) is a (5-maximal level-set estimator for P. 
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Proof We first consider x S suppfu' ). Define the function 7/^^(3;)= ,fK , defined over 
supp(ti^). From Bayes theorem, it is not hard to sliow that Pr{ + l|x} = ^j^j^^^r^- The 
loss for a point x when we assign it value y is (Bartlett et al., 2006): 



i{x, y)= Pr{+l|x}(/)(y) + Pr{-l|2;}0(-y) 

^ P{x)(t){y) + u'n{x)(l){-y) 
p{x)+u'^{x) 



It is easy to verify that for a fixed x, at the minimum (over y) of i{x,y), p{x)(f)'{y) = 
u'^{x)(j)' (—y). Alternatively: (j)'{—y) = V'„(a;)<^'(y). Let xi and X2 be two points such that 
4'n{xi) > ipn{x2)- Note that mm{(l)' (y) , (p' {—y)} < (^'(0) < for all y. Let Ci='ipn{xi) and yi 
be a solution to 4>'{—y) = Ci(j)'{y), for i G {1, 2}. Note that ci, C2 > and therefore, in order 
for equality to occur it is necessary that (j)' (yi), (p' {—yi) < (with equality only if Cj = 0). 
We can now rewrite (p'i-yi) = Ci<j)'{yi) as \(j)'{-yi)\ = Ci\(j)' {yi)\. 

We will now prove that yi > 2/2- Assume by contradiction that the statement is false. 
Then y2 > yi- Therefore, \(l)'{y2)\ < \<P'iyi)\ and |0'(-y2)| > \(l)'i-yi)\. Since ipnixi) > 
tpn{x2), it follows that ci > C2. If C2 = 0, then = |0'(— 7/2)1 > \4''{~yi)\- Therefore 
4>'{~yi) = and (j)'{yi) < 0, which gives = \(j)'{—yi)\ = ci|(^'(yi)| < 0, which is a 
contradiction. Thus, C2 / 0, and \^'{-y2)\ = C2|(^'(y2)| < C2Wiyi)\ = f |(/''(-yi)| < 
f |(/.'(-y2)| < W{-y2)\- Contradiction. 

Now consider the case where Cj = ^10/(^^)1 ^ > 0- Therefore, (j)'{yi),(j)'{—yi) < 0. Since 
is strictly convex over [— oo,0) it follows that as yi increases \4)'{yi)\ decreases and |(/>'(— yi)| 
increases. Therefore, if ipn{xi) = ipnix2) > 0, there is a unique solution. 

Therefore, h1^{x) is monotonically increasing with ip^ix), almost everywhere over supp(n„) 
and strictly monotonically increasing with ipnix), almost everywhere over supp(n„) P| supp(p). 
Since «'„(•) is constant over its support, this implies: p{xi) < p{x2) =^ /in(^i) < h*^{x2), and 
< p{xi) = p{x2) =^ ^n(^i) — ^n(^2)- Therefore, for some -y*, {x G supp(t('„) P| supp(p) : 
hnix) < d is identical to {x G supp(u^) flsuppd?) : p{x) < v^} (with the possible excep- 
tion of a set of points of zero Lebesgue measure). Recalling that Y* = h*^{Xp) for Xp ~ P 
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and that P(supp — 1: 

lim t* G lim {r G M : Fi-{Y* < t} < 5 and Fv{Y* < t} > 6} 

n— >oo n— !>oo 

= lim {r G M : P{{x : hl{x) < t}) < 6 and P{{x : hl{x) < r}) > 6} 

n— >oo 

= lim {r G M : P{{x G supp(ii'„) flsupp(p) : /i* (x) < r}) < 5 and 
G supp(u'„)P|supp(p) : < r}) > 5} 

=^ lim G lim {r' G M : G supp(n^) flsupp(p) : p{x) < t'}) < 6 and 

P{{x G supp«)P|supp(p) : p{x) < t'}) > 6} 
= lim {r G M : P({x : p{x) < t}) < 5 and : p{x) < r }) > 5} 

n— >oo 

={t' G M : P{{x : p{x) < t]) < 5 and P{{x : p{x) < r'}) > 6} 
={t' G M : Pr{K < T } < 5 and Pr{K < r } > 5} 

Therefore, let Vp G Sk{6) be such that = lim.n^ooV^- Note that since 6 > 0, Vp > 
(otherwise 6 < P{{x : p{x) < Vp) = 0). Therefore, for sufficiently large n, ?;* > 0. 

Let us assume that P doesn't have a (5-jump. Therefore, for almost every x G supp(u^), 
< t*) = < f*). Then almost everywhere in supp(u^): r(x) = lim„_^>oo I(/in(a:) < 

t;;) = < vf,). It is given that P(]R'^ \ supp«)) 0. Therefore, A({x supp«) : 

p{x) > Vp}) — 0. which is equivalent to A({x supp(ti'„) : < Vp) / ^'nl^^)}) ~^ 0- 

If P has a 5-jump, the proof is almost identical, only with minor changes in the strengths 
of inequalities. For almost every x G supp(u^): r(x) = lim.„_5.oo < t*) = ^p{x) < 

v^), and X{{x supp«) : I{p{x) < v?,) 7^ r*(x)}) ^0. ■ 



We will now make clear the relation between the algorithm given and Theorem 37. 
Clearly {f7„} is a series of distributions each having a uniform density, u„, over its bounded 
support. We will now prove that P(supp(n„,)) 1. 

Lemma 38. For any e > 0, Pr{P(supp(u„)) < 1-e} < 2e-2"^'-"°(^) and P(supp(u„)) 
1. 

Proof We define G{x) to be the cell in the grid which contains x. Define c{b)= \{S f] ^}| to 
be the count of the number of training samples which fall within set h. Then the histogram 
density estimate is p,i(a;)=^^^^. As shown by Devroye and Gyorfi (2002) in Theorem 5.6, 
Vv {j^^\p{x) - pn{x)\X{dx) > 2e] < 2e-2"^'-'^°W. However, since P is absolutely con- 
tinuous w.r.t. A, it follows from Scheffe's theorem (Scheffe (1947), used as Theorem 5.4 by 
Devroye and Gyorfi (2002)), that for any Borel set B over W^, Pr { /g \p{x) — pn{x)\\{dx) > e} < 

2g— 2n€-^— no{l) 
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By definition, Pn{x) = for all x supp(ii„). Therefore: 

Pr|pf]R'^\supp(n„)) > e| =Pri / - A(dx) > e I 

Since this is true for any e, it immediately follows that Pr{lim„_j.oo -P(K'^\supp(u„)) ^ 0} = 
0, or P(supp(u„,)) 1. ■ 

Therefore, the only remaining part is to show how t~ and relate to t* and to whether P 
has a (5-jump or not. We note that for all x, at the limit, h^{x) = hn{x) and therefore, Y* = 
Yn (i.e. they are distributed identically). For sufficiently large n, the quantile estimator 
used is unbiased with standard deviation vanishing at a rate of O f ^^+1 ) (^ielihski, 2004). 



Therefore, since 9n = o{y/n), it follows that = o{-^), and thus for sufficiently large 



t„ and are tightly concentrated around a ($(?ti„)(5„ + ^^-^^ ) -quantile (which is not 



greater than the (<^>(m„) S — J- H — ^-4^ ) -quantile) and a ( <^(m„) S + J- 



quantile for Z„, respectively. Therefore, since ^^J^^ = o^^^, by the following lemma, 
t~ and are also tightly concentrated around a (5^ -quantile and a 5^-quantile for Yn, 
respectively. 

Lemma 39. Let m > 0. Let be a ^$(m)/i + ^^"^'"^ ^ -quantile of Zn- Then, for some /i' 

M 

Proof 



such that \fi' — fi\ < a /i'-quantile of Yn, t*/, satisfies — tfj,\ < man- 



^(-m) 1 1 
u + , >— — Pr{Z„ < t} = — ^ Pr{y„ < t + e} 

>— J— Pr{i:„ < t - man} Pr{e > -man} = Pr{y„ < t - man} 
9[m) 

^(-m) 1 1 
^ 2$(m) -^>(7n) ^ " - ^^rn) ^ ^ 

^^J-^ [Pr{l;^ < t + mo-„} Pr{e < man} + Pr{e > man}] 

<i;:7-T [Prj^'n < t + mo-n}$(m) + $(-m)] 

= Pr{y„<t + ma„,} + ^^. 

S>(m) 

Therefore, + > Pr{y„ < t - man} and ^ - < Pr{y„ < t + mo-„}. Let 

Ai=Pr{l^ < t — man} — fJ-, and let A2=Pr{l^, < t + man} — Note that t — man is a 
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(/i + Ai)-quantile and that t + man is a (/U + A2)-quantile of Yn- Therefore, since A2 > Ai, 
for every A € [Ai, A2], there is some t' such that \t' — t^\ < man, which is a (^-|- A)-quantile 

m) 
2i(m) 



of Yn- To complete the proof, note that Ai < ^^^^-j* and A2 > — ^^7^- Therefore, there 



exists some A G 



<I>(— m) $(— m) 
' 2$(m) ' 2^{m) 



such that A G [Ai, A2]. 



Theorem 40. The rejection function output by the algorithm is (almost surely) identical 
to that of Theorem 37 at the limit, where U^=Un- 

Proof By definition, x G supp(ii^) An{x) G Gp. 

We represent by t*~ and t*"^ the 5~ and (5^ quantiles of Y* , around which (for sufficiently 
large n), t~ and are tightly concentrated. In particular, t*~ < t*"*" =^ t~ < t+. Note that 
t~ < t'^ and i*~ < t* < t*"*" always, and at the limit, t~ = t*" = t* = t*"*" = i^- We now 
consider four cases. 

In the first, P doesn't have a (5~-jump or a (^-jump (see Figure 7.1(a)). Then, t*~ < t* < 
t*"*". Therefore, for x G supp(n^), at the limit: I(/i„(x) < t^) = I(/i* (x) < t^(x)). 

In the second, P has a (5~-jump but it doesn't have a (5-jump (see Figure 7. 1(b)). Then, 
for sufficiently large n, t*" = t* < t*+ and (5~ < Pr{y„* < < 6. Therefore, for 

X G supp(n'„), at the limit: l{hn{x) < t~) = lih^{x) < t*^{x)). 

In the third, P doesn't have a (5~-jump but it does have a (5-jump (see Figure 7. 2(a)). Then, 
for sufficiently large n, t*~ < t*n = t*^^ ■ Therefore, for x G supp(n'„), at the limit: 

l{hn{x) <t-) = I{hl{x) < tl{x)). 

In the fourth, P has both a 5~-jump and a (5-jump (see Figure 7. 2(b)). Then, for suffi- 
ciently large n, t*~ = = i*"*". Therefore, for x G supp(it'„), at the limit: I(/i„(x) < t~) = 

i{K{x) < tUx)). m 



Remark 41 (Rates of Convergence and Finite Sample Notes). The time complexity 
for our algorithm is 0{C{n)+n), where C(re) is the time complexity for the soft-classification 
algorithm. The rate of convergence for the given algorithm is ^^^-^ =0 ^ ^ , for any 
e > 0, in addition to the classifier's rate of convergence.^ On is only affected by the quantile- 
estimator used. In our case, the quantile estimator utilized only requires that F, the cdf 
whose quantile is being estimated, be continuous. To meet this condition we added the 

9. The classifier doesn't truly need to minimize the loss. Depending on the quantile-estimator, it is possible 
that only classifier errors which result in "ordering violations" across the (5-quantile can affect the output 
(beyond whether a strong or weak inequality is used for testing the threshold). Thus, faster rates than 
the classifier's convergence rate to the minimum may be possible. Also, ranking algorithms (see, e.g., 
Clemengon et al., 2005) could be used instead of soft-classification. In this case, achievable error rates 
could provide (loose) upper bounds on such ordering violations. 
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noise term e. Note that e has no effect on the convergence rate; this is because cr„ can 
vanish as fast as desired. Similarly, by Lemma 39, we can achieve arbitrarily tight bounds 
on the nearness of the quantiles of Zn and Yn by increasing the rate at which tends to 
infinity. 

For finite sample sizes, some additional modifications are advisable. First, in order 
to ensure that h*^{-) = E^/^ On should be of size n' ~ NB{n,^), and not n. It 
is also possible to use a non-uniform prior probability (without affecting the algorithm's 
correctness), if it is desired. A validation set could be used for determining the quantile- 
estimates, rather than the training set. Note that for finite samples, it is not guaranteed 
that Prjy^ < t^} ~ 5^ . In fact, it is possible to be significantly larger if Yn has a large jump 
in the range [t~ -m(T„, Since by Lemma 39, Pr{y„ < t~-man} < <^n + + o (^^) , 
almost surely, we can address this issue by refining the definition of the rejection function 
output by the algorithm: 



rn{x) = 



1 An{x) ^ G% 

l{hn{x) <t~ — man) An{x) G Gp and t~ < tn', 
I{hn{x) < tn — man) otherwise. 



Note that this fix isn't possible when using a ranking algorithm in place of a soft binary 
classifier, since only points, and not values, can be compared (i.e., x and the chosen quantile 
point in Sz are compared in order to determine whether to reject x). 

Finally, one needs to determine 5~ , so that it is guaranteed (with high probability) that 
p{rn,P) < (5. To accomplish this, one must take into account the quantile estimator used, 
since 5^ < 5 — and P(R'^ \ supp(tin)), since this is always rejected. It is known (Hall 

& Hannan, 1988) for the histogram density estimator, upon which the sampling of 0„ in 

1 

the algorithm is loosely-based, that gn of order n is optimal for minimizing Lf, distance 
for 1 < 6 < oo, and that gn of order {^^^ ''^'^ is the correct order for minimizing Loo 
distance. However, we are only interested in P(M'^ \ supp(iin)). We note that this is just 
the missing mass. Let ni be the number of grid cells containing exactly one point from 
the sample. Then, as shown by Robert and Schapire (2000), with probability at least 



1 - 1], P{R'^ \ supp(n„)) + {2V2 + V3) y Clearly, increasing gn results in m 
decreasing. Therefore, gn should be large in order to minimize P(M'^ \ supp(u„)) and small 
in order to minimize Un{R'^ \ supp(p)) (since if gn vanishes faster, A(supp(ii„) \ supp(p)) 
decreases faster as well). This results in a simple heuristic, namely to set gn to the smallest 
value such that ni < t, for some threshold t. For example, if we know the sample is "clean" 
in the sense that all points are drawn i.i.d. according to P, then we can take t = 0. A 
larger value of t could be chosen were we to suspect that the sample may contain noise, for 
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example t = logn. In general, it remains an open question of how gn should be optimized 
to balance between P(M'^ \ supp(u„)) and C/n(M'^ \ supp(p)). 

Remark 42. Cuevas and Praiman (1997) use a plug-in approach to support estimation that 

can be leveraged here to further decrease Un(^'^ \ supp(p)) when p has compact support 

1 

and is continuously differentiable. Let gn = cn ^+2 for some constant c, and let be 
such that = o{g~^). For example, an=^/g^, or if d = o(^j^^^^^, then an='^^- 
Then, let Gp only contain the "lower-left" corners of grid cells containing more than non 
sample points. Since this only decreases C/n(IK'^ \ supp(p)). Lemma 32 remains correct and 
C/„(M'^ \ supp(p)) 0. Furthermore, A(supp(p) A supp(w„)) (Cuevas & Fraiman, 
1997), and thus P{R'^ \ supp(u„,)) ^ 0, as weh. One may use results given by Robert and 
Schapire (2000) to obtain an upper bound on P{W^ \ supp(n„)) for finite sample sizes. 

Remark 43. It may be possible to improve on the convergence rate for the quantile- 
estimator, by using more information than 1 to 2 order statistics. This carries with it 
the risk of being less robust to classifier error. One such method is kernel-based quantile 
regression (Christmann &: Steinwart, 2008), which is provably consistent. More complex 
quantile estimation methods may be useful in improving the convergence rate, without 
affecting the overall time complexity (dependent on the soft-classifier's time complexity), 
but these may exclude the use of ranking algorithms, as the quantile estimation method 
may rely on more than the relative ordering of the sample points. 

7.4 Discussion 

We have provided a computationally simple and consistent procedure for determining a 
(5-maximal level set estimator, which for measures that don't have a (5-jump, is also a 6- 
tight LDRF. While we have generated a uniform distribution for identifying low-density 
areas of P, this is not strictly necessary. Indeed, to return to the investment analogy, it 
is only necessary that the low-density areas have greater ROI than the high density areas. 
We term distributions which meet this condition as lenient adversarial distributions. The 
soft-classification approach used in this section applies for any such lenient adversarial dis- 
tribution. Indeed, lenient adversarial distributions can also be used when the underlying 
mechanism is a hard-classifier. See (Nisenson, 2010) for a full discussion on lenient adver- 
sarial distributions and their relation to the existing SCC literature. The importance of 
these results, including the generation of a "tight" lenient adversarial distribution as given 
by the algorithm, lies not only in their justification of various approaches in the literature, 
but also in their applicability. Their only requirements are on the loss function used and 
that P be absolutely continuous with respect to the Lebesgue measure. Since most common 
loss functions satisfy the requirements and the condition on P is quite weak, a large body 
of results for regression and two-class classification can be utilized. 
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8. On The Dual SCC Problem 

In the dual SCC problem the learner would like to guarantee the type II error, and minimize 
the type I error. This problem can be relevant to intrusion detection and authentication 
applications as well as to data mining and novelty detection. For example, in a biometric 
passport authentication system the authorities may mandate a maximal intruder pass rate. 
Under this constraint one would clearly want to minimize the false alarm rate. An alter- 
native example is spam detection. A user may already have a two-class classification spam 
detection system in place. This system may perform very well at detecting spam which is 
similar to previously encountered spam. However, spammers are continually updating their 
spam so it will evade these filters. A second level system could be created, where an SCC 
classifier is trained on the legitimate e-mails. Any e-mails which the first-level determines 
as legitimate would then be tested against the second-level SCC classifier, which would 
either accept or reject them. A user may be willing to tolerate a certain level of spam from 
this second-level system, such as 1 in every 100 messages belonging to a new spam class 
getting through, but given that rate, would like as few legitimate messages as possible to 
be rejected. 

Let 6q be the maximally allowed type II error. Then the dual SCC problem is: 

argmin p{r, P) 

r 

such that: p{r, Q) > I — 5q, VQ G Q, 

where r(-) is any function Q — t- [0,1]. When Q is discrete and finite, this problem has a 
finite number of variables and a possibly infinite number of constraints depending on Q. 
Thus, it is a linear semi-infinite program. 

We represent by r^(-) a solution to the primal problem, and by ?"//(•) a solution to 
the dual problem. Define 6*=p{r*jj, P) and Sq=1 — miuQ^Q p{r*j,Q). Since r{uj) = 6 and 
r{uj) = 5q are respectively feasible solutions to the primal and dual problems, 5* < 6q and 
S < 5*Q. 

Lemma 44. Let be finite and discrete. If 6q > 0, then p{r*j,P) = 6. If 6* > 0, then 
miuQ p{r*jj,Q) = 1 - Sq. 

Proof Let 6q > 0. Let us assume by contradiction that p{r*^,P) < 5. Then, define 
r'{u)=mm^l,r*j{u) + |. Clearly, p{r',P) < 6 and mmQp{r',Q) > ming p(r|, Q). 

Contradiction. 

Let 6* > 0. Let us assume by contradiction that mmq p{rjj,Q) > 1 — 5q. Then, 
define r"(t^)= max |o, r^^(u;) - ^^i^I^QPirli^HlzMy Clearly, mmg p{r",Q) > 1 - 6q and 
p{r",P) < p{r*jj,P). Contradiction. ■ 
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We define R^=R^ as the set of primal-optimal rejection functions, and as the set 
of dual-optimal rejection functions. Examining the dual SCC problem in the investment 
analogy, the learner is assigned a target amount of money, 1 — 6q, which must be obtained on 
selling all assets. The learner's goal is to achieve this with the minimal starting investment. 
We can see that if the learner invests no money, then the amount of money made will fall 
short of the target. By investing in assets with higher ROI, the learner makes the most 
amount of progress towards the target with the least amount of money invested. Thus, we 
can see that the optimal investment strategy is likely to be similar to that of the primal 
problem. In fact, as shown by the following theorem, under mild conditions, the two sets 
of optimal strategies are identical. 

Theorem 45 (Primal-Dual Equivalence). Let be finite and discrete. If S > and 
6*Q > 0, then = If 6q > and 6* > 0, then = 

Proof Let 5 > and 6q > 0. By Lemma 44, p{r*^,P) = 5. Clearly, rj is a feasible 
solution to the dual problem with 5q = 6*q. Thus, 5* < 5. Let us assume by contradiction 
that 5* < 6. Then, there must exist some r*(-) such that uimq p{r* , Q) > 1 — 5q and 
p(r*,P) < 6. Define r^(^)= min |l, r*(^) + ^"^y-^^ j. Then, clearly p{r',P) < 6, but 
miiiQ p{r' , Q) > miuQ p{r* , Q) > 1 — Sq = ming p{r}, Q). Contradiction. Therefore, 5* = 5. 

Since (5* = (5 > 0, by Lemma 44, ming p{r*ii, Q) = 1 — 5q. Thus, r G if ming p(r, Q) = 
1 — 5*Q and p(r, P) = 5. Likewise, r G Rgl if ming p{r, Q) = 1 — 6*q and p{r, P) = 5. 
Therefore, = 

Let 5q > and 5* > 0. By Lemma 44, miiiQ p{r}j,Q) = I - Sq. Clearly, rjj is a 
feasible solution to the primal problem with 6 = 5*. Thus, 6q < 6q. Let us assume by 
contradiction that 6q < 6q. Then, there must exist some r*{-) such that ming p{r*,Q) > 
1-5q and p{r*,P) < 6*. Define r"(w)= max |o, r*(a;) - I^^HQPi^l^^hllzMy Then, clearly 
mmQ p{r" , Q) > 1 — 5q, but p{r",P) < p{r* , P) < 5* = p{r*jj,P). Contradiction. Therefore, 
S*q = 6q. 

Since S*q = 6q > 0, by Lemma 44, p{r},P) = 6 = 6*. Thus, r E if ming p{r,Q) = 
1 — 6q and p{r, P) = 6* . Likewise, r £ if ming p(r, Q) = 1 - 6q and p{r, P) = 6* . 
Therefore, = ■ 

Using Theorem 45, it is trivial to solve the dual SCC problem where Q = Q\ = {Q : 
Dp{Q) > A}. To begin with, since we assume that Pi > for all i, note that 6q < 1 ^ 6* > 
0. Therefore, since 6q > the optimal solution sets are identical by Theorem 45, and all the 
intermediate results, including Theorem 19, are correct when solving the primal problem 
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with 6 = 6* > 0. Therefore, it is trivial to construct a dual-analogue to Theorem 23. We 
also prove the analogue to Lemma 24. 

Theorem 46 (Dual SCC Linear Program). An optimal soft rejection function and the 
optimal type I error, zj, is obtained by solving the following linear program: 

minimizeri^r2,...,rif ,2/ zj, subject to: 

K 

'^nlSilpi^Si) < zi (3) 

i=l 

1 > n > r2 > • • • > r/< > 

>1-6q, zG {1,2,...,|£||H| + |M|}. 
Lemma 47. Let r* be the solution to the linear program. If r* is vulnerable, then r* = 

y.l-<5Q_ 

Proof Let r* be a vulnerable solution to the linear program (3), which clearly satisfies 
^ ^ r\ > r2 > ■ ■ ■ > r*^ > ^- Therefore, for all i G Imin{i'*)-, r*{i) = r*^. We define z*j to be 
the minimal value of z/ that the linear program achieves for r* . Let j = argmin^gj^^ ,^ 
and let 5"^ be the level set to which j belongs. We now prove that u = 1. 

We first deal with the case where > A (in which case the constraint is completely 
vacuous). We note in this case that Si, ^2, . . . , Sk ^ Cy]M, and therefore w = K. Thus, 
we have r\ > > ■ ■ ■ > r*^^ = r"^ > I - 6q. Therefore, '}2d=i\Si\p{Si)r* > r*^ > I - 6q. 
Therefore, z*j > 1 — 6q. We note that ri = r2 = . . . ri<- = 1 — (5q is a valid solution to the 
linear program for which z/ = 1 — 5g, which is the minimal value achievable. Therefore, 
z* = 1-6q. liu> 1, then > > = 1 - 5q and T.i=l\SMS^)rt > r*^ > I - 5q. 
Therefore, if > A, u = 1. 

We now turn our attention to the case where D'p' < A. If we assume by contradiction 
that u > 1, then r*_;^ > '"u = Ki+i = ■ ■ ■ = r*^- Dp (X'^-'^) > A and by our assumption, 
Dp'^ < A, which implies that Sk ^Ti- HDp (X^-?)) > A, then there exists some I for which 
j £ Su = Si £ C. The rejection rate for the level-set pair, K), is > 1 — 6q. Otherwise, 
Dp (A:(J)) = A, and j G 5„ = S^ G M, and we have a rejection rate for Sm of — 
and since > 1 — 6q, this implies that = r*^ = > 1 — 6q. Therefore, since u > 1, 
we get that Ylf=i\Si\p{Si)r* > r*^^ > 1 - 6q. Therefore, if Dp" <A,u = l. 

Therefore, u = 1. This results in = ^2 = ■■■ = f*x ^ ^ ~ ^Q- Therefore, 
^^-i^|S'j|p(S'i)r* = is clearly minimized by = 1 — (5q, or r* = r^~^Q. ■ 
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9. Concluding Remarks 

We have introduced a game-theoretic approach to the SCC problem. In this approach the 
learner is opposed by an adversary. We believe that this viewpoint is essential for analyzing 
SCC applications such as intrusion detection and, in general, for "agnostic" analysis of 
single-class classification. This game-theoretic view lends itself well to analysis, allowing 
us to prove under what conditions low-density rejection is hard-optimal and if an optimal 
monotone rejection function is guaranteed to exist. Our analysis introduces soft decision 
strategies, which potentially allow for significantly better performance in our adversarial 
setting. 

Observing the learner's futility when facing an omniscient and unlimited adversary, 
we considered restricted adversaries and provided full analysis of an interesting family of 
constrained games (in a decision-theoretic "Bayesian" setting where P is assumed to be 
known). The constraint we imposed on the adversary, given in terms of a divergence gap 
between the target and opposing distributions, is inspired by similar constraints used in 
"two-sample problem" related work in information theory (Ziv, 1988; Gutman, 1989; Ziv 
& Merhav, 1993). Of course, to compute the optimal learner strategy one has to know the 
exact value of this divergence gap, which is unknown in pure SCC problems. In applications 
we expect that something will be known or could be hypothesized about possible opposing 
distributions. For example, in biometric authentication, one should be able to statistically 
measure this gap. Thus, one could perhaps determine with a high confidence level that at 
least 99.9% of the population has a distribution with a KL-Divergence of at least 10 from the 
distribution of any member of the population. This can obviously be extended to /c-factor 
authentication (see, e.g., Pointcheval & Zimmer, 2008). Assuming that the adversary may 
know k — 1 factors, gaps can be found for each of the factors in order to ensure a particular 
intruder pass rate, with high probability. A different type of example occurs in extremely 
unbalanced two-class classification problems. Here, one could utilize the very few given 
examples from the other class to infer a bound on the gap. This complements the results 
of Kowalczyk and Raskutti (2003) where one-class learners were found to out-perform their 
two-class counterparts in some settings. 

Our final major contribution is a simple and computationally feasible one-class classifi- 
cation algorithm. The SCC classifier is generated by thresholding a soft two-class classifier's 
output, where the output serves as a proxy for a density estimate, and a quantile estimate 
serves as the threshold. This approach can be extended to other use cases. For example, in 
(Yeh, Lee, & Lee, 2009), a multi-class classification problem is solved by constructing SVDD 
(D. Tax &: Duin, 1999) one-class classifiers, where each class is described by a sphere, and 
learning a discriminant function which assigns a test point the class whose sphere center- 
point it is "nearest" to (the distance is normalized by various statistics). Instead of using 
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SVDD, our approach would be to create a two-class classifier for each class, where the 
second class is uniformly distributed over the active cells. A test point would be passed 
to each two-class classifier and the class chosen would be that belonging to the classifier 
which ranked the test point in the highest quantile (relative to the training sample for each 
class). This classification scheme makes sense because it labels the test point with the class 
for which it has the highest "relative" density (relative to other points within each class). 
Thus, we achieve the same goal without resorting to heuristics. 

We have introduced a dual SCC problem and shown that, under very weak conditions, 
the solution sets for the primal and dual problems coincide. This allows one to easily extend 
results from one setting to the other, as we demonstrated by providing the dual solution to 
the constrained family of games considered earlier. 

Various extension and generalizations to these results can be found in (Nisenson, 2010). 
These include extensions of Section 5 results to the infinite discrete setting and extensions 
to Section 7 giving additional results in the continuous setting such as a two-class reduction 
of SCC to hard binary-classification (as opposed to soft classification as we present here). 

Our work can be extended in various ways and we believe that it opens up new av- 
enues for future research and in particular could be useful for inspiring new algorithms for 
finite-sample SCC problems. One of the most important questions would be to determine 
convergence rates for the algorithm given in Section 7.3. It would be very nice to obtain 
an explicit expression for the lower bound output by the linear program of Theorem 23. 
Extensions of the analysis and algorithms for additional feature spaces, such as graphs or 
time-series, would be useful. An interesting question is whether performance, whether in 
terms of type II error or convergence rates, could be improved in different spaces. Clearly, 
the utilization of randomized strategies should be carried over to the finite sample case 
as well. A natural desirable extension is to extend our analysis for the soft setting to 
continuously infinite spaces. 

We have focused in this work on "single-shot" games, meaning that the learner has to 
make a decision after every test observation. This is a very difficult setting as one cannot 
utilize cumulative statistics of the other class. Thus, we would expect that the results could 
be improved upon in a repeated-game setting, where several observations are provided from 
the same distribution, or in change point detection (Page, 1954; Hinkley, 1970), where one 
has to determine in a series of observations where the distribution P has been replaced 
by the (unknown) distribution Q as the underlying source. In the finite discrete setting, 
one should be able to easily extend some of the results here by replacing events with types 
(Cover &: Thomas, 1991). Finally, a very interesting setting to consider is one in which the 
adversary has partial knowledge of the problem parameters and the learner's strategy. For 
example, the adversary may only know that P is in some subspace. 
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Appendix A. Section 5 Proofs 

As a reminder, we assume that Pi > for all i G il. Furthermore, for convenience we assume 
w.l.o.g. that is defined such that < pi < p2 < ■ ■ ■ < Pn- 

Lemma 48. Let a + b = c + d and a + c > b + d. Then, a > d. 

Proof Clearly f > ^±f^. Therefore, a > ^^d>±d^ = d ■ 

Theorem 4 (Optimal Monotone Hard Decisions). When the learner is restricted to 
hard-decisions and Q satisfies Property A w.r.t. P, then there exists a monotone r E TZ'^. 

Proof Recalling that < pi < P2 < • • • ^ Pn, we now define a rejection function as being 
x-monotone, if it is monotone up to index x. In other words, a rejection function, r(-) is 
x-monotone if < pk ^ r(j) > r(k), for all j < k < x. Clearly, all rejection functions are 
1-monotone, and a monotone rejection function is A^-monotone. 

Let us assume, by contradiction, that no monotone rejection function exists in TZg. We 
will prove the existence of an A^-monotone rejection function in TZg via induction. Let 
r G TZg. Then, r is {k — l)-monotone but not /c-monotone, for some 2 < k < N . Let 
j = min{i : r{i) = 0}. We note that 1 < j < A; and r{k) = 1 (otherwise, r would be 
A;- monotone) . We now prove the existence of a fc-monotone rejection function, r* G T^J. We 
define r* as follows: 

[l i = j, 
r*{i) = lo i = k, 

I r{i) otherwise. 

Note that for all i < j, that r*{i) = 1, and for aU j < i < k, that r*{i) = 0. Thus, r* 
is a fc-monotone rejection function. We now prove that r* € 7^^. Note that p{r*,P) = 
p{r, P) + pj — pk < p{r, P) < 5, and thus r* is a (5-valid hard rejection function. Let 
Q* £ Qhe such that minQp(r*,Q) = p(r*,Q*) = p(r,Q*) + q* - g*. Thus, if q* > g*, 
p{r*,Q*) > p{r,Q*). Otherwise, there exists Q*' as in Property A and in particular, by 
Lemma 48, q* > q*'^. Consequently, p(r*, Q*) = p{r, Q*') + - q*'k > p{r, Q*')- Therefore, 
there always exists Q G Q such that p{r*,Q*) > p{r,Q) (either Q = Q* or Q = Q*'). 
Therefore, miug p{r* , Q) > miug p{r, Q), and thus, r* G 7^^. Therefore, by induction, there 
must exist an optimal A^-monotone rejection function. Contradiction. ■ 

Remark 49. The above proof works for a weaker version of Property A: If for all pj < pk 
and Q G Q for which qj < q^, there exists a distribution Q' G Q such that Qj — Qk + J2i=iiQi~ 
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m\n{qi — g^, 0} > 0. As used in the proof, this would read: 



N 



p{r\Q*)=p{r\Q*')+Y.r*{i){ql-q*[) 



1=1 



3 



1=1 i=k+l 
3 



>p{r,Q*') +q*'j 



1=1 i=k+l 



>p{r,Q*'). 



Remark 50. If we strengthen the condition in Property A from qj + q'j > qk + Qk to 
qj + q'j > qk + I'k for all distributions Q such that qj < q^ (instead of qj < q^), then all 
optimal rejection functions would be monotone. Note that the set of all distributions does 
not have this modified property, but the set of all distributions bounded away from zero 
{{Q : > 0, V-i G $7}) does. 

Theorem 6 (Optimal Monotone Soft Decisions). 

If Q satisfies Property B w.r.t. P, then there exists an optimal strictly monotone rejection 
function. 

Proof We note that the condition for strict-monotonicity is equivalent to pj < Pk ^ 
fij) ^ ^(k), and that < pi < P2 ^ ■ ■ ■ ^ Pn- We now define an x-right-strictly-monotone 
rejection function as one which has strictly-monotone properties for the last x indices. In 
other words, a rejection function r(-) is x-right-strictly-monotone if pj < Pk ^ > r{k), 
for all J < A;, A; > — x. Clearly, all rejection functions are 0-right-strictly-monotone, and 
an A^-right-strictly monotone rejection function is strictly monotone. 

We assume contradictorily that there is no such rejection function. Let r G TZ'^. We 
note that r is {v — l)-right-strictly- monotone but not v-right-strictly- monotone for some 
1 < f < A^. We will prove by induction that there exists an A^-right-strictly-monotone 
function in TZg. Let k = N — v + 1. Since r is not u-right-strictly-monotone, then there 
must exist some j < k for which pj < pk and r{j) < r(k). Define, for any event cj and 
distribution D: 



Sr{uj) is the intersection of w's probability level-set with tj's rejection level-set. g{D,uj) is 
simply an average of the elements of D corresponding to symbols in Sr{u)) normalized by 
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We note that g{P,u}) = 1 always. We define r* as follows: 



r u 



\SrU)\pMj)+\Sr{k)\p,rik) i^Sr{j)USr{k), 



\SrU)\Pj + \Sr{k)\pk 



r{i) 



otherwise; 



(■^ \Sr{j)\Pjr{3) + \Sr{k)\pkr{k) 

\^r{j)\Pj + \Sr[k)\pk 



> 



r{k) — r*{k) 



yD,p{r*,D)- p{r,D) 



\Sr{k)\pk{r{k) - r{j)) 

\SrU)\Pj + \ Sr{k)\pk 

\Sr{j)\Pj{r{k) - r{j)) _ \Sr{j)\pj 



\Srij)\Pj + \Srik)\pk \Srik)\pk 



{r*{j)-r{j)) 



\j)-r{j)) 

ieSrij) 



+ 



ir*ik)-r{k)) di 

ieSr(k) 



{r*{3)-r{3)) 



iGS,(i) 



--r{j)-r{j))\Sr{j)\p, 



E 



\Sr{3)\Pj 
\Sr{k)\pk 



ieSrik) 



GSr{k) di 



\Sr{3)\Pj \Sr{k)\pk 

= ir*{3) - r{mSrij)\Pj [g{D,3) - g{D, k)] . 
Therefore, noting that r*{j) > r(j), 

p{r*,D)<p{r,D)^ g{D,3) < g{D,k) 



di di 
min — < max — . 

i<^Sr{j) Pj i&Sr{k) Pk 



(4) 



Since g{P,j) = g{P,k) = 1, p{r*,P) = p{r,P) = 6. Therefore, r* is a valid rejection 
function. Let u > k. We note by the definition of r* and the fact that r is (f — l)-right- 
strictly-monotone that r*(u) = r{u) < r{j) < r*{j) = r*(k) < r{k). Therefore, r* is still 
(v — l)-right-strictly-monotone (but not necessarily v-right-strictly-monotone) . 

Let Q* be such that p{r*,Q*) = ming p(r*, Q). We will now show that 3Q £ Q s.t. 
p{r*,Q*) > p{r,Q) (and therefore, mmQp{r*,Q) > miuQ p(r, Q)). The following algorithm 
finds such a Q: 

1. Set Q = Q*. 

2. while p{r*,Q) < p{r,Q) 

(a) Let a and h be such that qa = minjg5^(j) qi and qh = maxjg5'^(fc) qi. We note that 

p{r*,Q) < p(r,Q) ^2«<9L^2<i<2l.. 

Pj Pk Pa Pb 

(b) Since Q satisfies Property B, there exists a Q' £ Q which is identical to Q for all 
i^a,b and such that 3k> Set Q = Q' . 

' ' Pa — Pb ~v -V 
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3. end while. Output Q = Q. 

Since for all iterations, r*(a) = r*{b), at step (b) we have p{r*,Q') = p{r*,Q) = p{r*,Q*). 
After setting Q = Q' at step (b), we have ^ > and therefore the loop never repeats 
for the same pair of symbols {a,b). Therefore, the loop is guaranteed to terminate. After 
ending, p{r*,Q*) = p{r*,Q) > p{r,Q), so ming p{r*,Q) > m\nQp{r,Q), and r* eTZ}. 

While there still exists a j such that < r*{k) we relabel r* as r and repeat the 

above procedure (note that it never repeats for the same pair j,k). The resulting r* is 
{v — l)-right-strictly- monotone as shown above, but since now j < k ^ ^ r*{k), r* is 

u-right-strictly-monotone. 

Thus, by induction there exists an optimal A^-right-strictly-monotone rejection function, 
which is a contradiction. ■ 



Remark 51. Strengthening the conditions in Property B to — < — and J- > ^ would 

Vj Pk Pj Pk 

strengthen Theorem 6 so that all optimal rejection functions are strictly monotone. Once 
more, the set of all distributions does not have this modified property, but the set of all 
distributions bounded away from zero does. 

Theorem 10 (LDRS optimality). Let r* be an LDRF. Let r be any monotone (5- valid 
rejection function. Then 

minp(r*,Q) > minp(r, Q), 

QeS QeQ 

for any Q satisfying Property C. Thus, if Q possess both Property A and Property C 
w.r.t. P, then LDRS is hard-optimal. 

Proof We define, for a hard rejection function r, 0(r)= min(^.^(^)=oPi^5 ZQ{r)={uj : p^^ = 
e{r)Ar{uj) = 1} and ze{r)=\Zg{r)\. 

Assume, by contradiction, that miuQgQ p(r*, Q) < minggg p(r, Q). Let Q* be the min- 
imizer of p{r*,Q). Then, p{r*,Q*) < p{r,Q*). If 6{r) > 6{r*) then, by the definition 
of LDRF and by the monotonicity of r, p{r,P) > 5, which contradicts r's validity. If 
9{r) < 9{r*) then, by r's monotonicity, r{uj) = 1 =^ r*{uj) = 1, and for any distribution 
D, p{r,D) < p{r*,D), contradicting p{r*,Q*) < p{r,Q*). Therefore, 6{r) = 9{r*). If 
zg{r) > zg{r*) then p{r,P) > 5 since r* is an LDRF. Otherwise, zg{r) < zg{r*), and by 
Property C the set Q contains all distributions identical to Q* up to a permutation of the 
^-probability events. Therefore, minggg /3(r*, Q) > minggg p(r, Q). Contradiction. ■ 



Appendix B. Section 6 Proofs 

Lemma 24. Let r* be the solution to the linear program. If r* is vulnerable, then r* 
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Proof Let r* be a vulnerable solution to the linear program (2), which clearly satisfies 
^ > rl > r2 > • • • > r'l^ > 0. Therefore, for all i G Imin{i'*)-, r*{i) = r*^. We define z* to be 
the maximal value of z that the linear program achieves for r*. Let j = argmin^gj^^ ,^ (.^.-j 
and let Su be the level set to which j belongs. We now prove that u = 1. 

We first deal with the case where D^'^ > A (in which case the constraint is completely 
vacuous). We note in this case that S*!, 52, . . . , Sk ^ CyjM, and therefore w = K. Thus, we 
have rl>r2>--->r*^ = rl^>z*. We note that z* < 5, otherwise 5 = Yl,f=i\^i\p{^iy*i ^ 
r*^ > z* > 6. We note that ri = r2 = . . . rj<- = 5 is a valid solution to the linear program 
for which z = S, which is the maximal value achievable. Therefore, z* = 6. If u > 1, then 
rl>r*j^>z* =6 and J2f=i\Si\p{Si)r* >r*j^>5. Therefore, if D^"" > A, u = L 

We now turn our attention to the case where Dp^ < A. If we assume by contradiction 
that n > 1, then r*„;^ > = ^u+i = • • • = r*^- We define Pr[S']=|5|p(S') for a level set 
5, and c= "^^r'i. . Let < e < "-^ We now define a new reiection function r' as 
follows: 



/ A 



r* i < u - 1, 

r* — e i = u — 1, 
r* + ce i > u. 
We note that: 

K 

p{r',P) =p(r*,P) -Pr[5„_i]e + ^Pr[5i]ce 

i=u 

=p{r\P) - Pr[5„_i]e + Pr[5„_i]e = p{r*,P) = 5. 

Therefore, r' is (5-valid. Let z' be the maximal value of z that the linear program achieves 
for r'. Dp > A and by our assumption, Dp'^ < A, which implies that Sk £ Ti. If 

Dp (^X^^^^ > A, then there exists some I for which j £ Su = Si £ C The rejection rate for 
the level-set pair, {l,K), is r|^. Otherwise, Dp (X*^-'^) = A, and j £ Su = Sm £ -M, and we 
have a rejection rate for Sm of = r'^- Since z* cannot be less than r,J^^^ = r|^, we have 
z* = r*j^ in both cases (r|^ > z* > r|^). We note that: 

r'u-i -r'u = «-i - e) - « + ce) = «_! - <) - (c + l)e > 

Clearly for i > u, r\_^ = = ''If + ce. Obviously, 1 > r'^ and > > 0. Therefore, 
1 > ''i > ^'2 ^ ■ ■ ■ ^ ''if > 0, and r' is a feasible solution to the linear program. Furthermore, 
z' > f'min ~ ''if > ''If ~ which contradicts the fact that r* maximizes z (and is the 
solution to the linear program). 

Therefore, u = 1. This results in r| = r2 = • • • = r|^. Since p{r*,P) = 6, we have that 
6 = J2f=i\Si\p{Si)r* = r*j^, oi r* = r\ ■ 
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Appendix C. Section 7 Proofs 

We begin by providing some additional definitions. Let IB be the set of all Borel sets over 
W^. For two Borel sets a,b we define a = b 4^ X{aAb) = 0, where A is the symmetric 
difference operator. For two functions, f,g over and Borel set b, define Af){f, g)={x G 
b : f{x) ^ g{x)}. We define the function If,{x)=I{x G b), where I(-) is the indicator function. 

Lemma 52. Let m' G coves{P). Let m be a Borel set such that m = m'. Then m G 
cor:es{P)- 

Proof Since m' G coies{P), P{m') = 5 and there exists a minimum volume set b' of mea- 
sure 1 — (5, such that m'P|6' = 0. Let b=b' \ m. We note that X(rn Am') = 0. Therefore, 
b = b' \ m = b' \ m' = b' . Therefore, b' is a minimum volume set of measure 1 — 6. Since 
mP|6 = and P{m) = 5, m £ cores{P)- ■ 



Theorem 30 (LDRS optimality - Continuous Setting). When the learner is restricted 
to hard-decisions and Q satisfies Property Acont w.r.t. P, then LDRS is optimal. 

Proof Assume that the statement is false. Therefore, there must exist some m G coTes{P) 
such that for all r G R*^, X{A^d{r,Iip(^^))) > 0. Let r' G R*^, such that p{r',P) = 6. Define 



r{x)= 



1 p{x) = 0, 
r'{x) otherwise. 

Therefore, r £ R*g and A(Ajjd(r, > 0. Let j = {x £ m : r{x) = 0}. Therefore, 

P{j) > 0. Thus, there must exist a set k, such that A; P| m = 0, k C supp(p), r{x)X{dx) = 
X{k), and P{k) = P{j) (otherwise, p{r' , P) ^ 5). Since P{k) = P{j) > and m G core5(P), 
we have A(j) > X{k). We define: 

/■ 

1 X £ j, 

X £ k, 
r{x) otherwise. 

We note that p{r*,P) = p{r, P) < 5. 

Let Q* £ Q be such that mmg p{r*,Q) = p{r*,Q*) = p{r,Q*) + Q*{j) - Q*{k). Thus, if 
Q*{j) > Q*{k), p{r*,Q*) > p{r,Q*). Otherwise, there exists Q*' as in Property Acont and 
in particular, by Lemma 48, Q*{j) > Q*'{k). Consequently, p{r*,Q*) = p{r,Q*') + Q*{j) — 
Q*{k) > p{r,Q*'). Therefore, there always exists Q £ Q such that p{r*,Q*) > p{r,Q) 
(either Q = Q* or Q = Q*'). Therefore, muiQ p{r* ,Q) > m.m.Q p{r,Q), and thus r* £ R}. 
However, A(A]gd(r*, = 0. Contradiction. ■ 
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